[ceph-users] any recommendation of using EnhanceIO?

Discussion:

German Anders

2015-07-01 20:13:03 UTC

Hi cephers,

Is anyone out there that implement enhanceIO in a production
environment? any recommendation? any perf output to share with the diff
between using it and not?

Thanks in advance,

*German*

Dominik Zalewski

2015-07-01 21:29:43 UTC

Permalink

Hi,

Iâve asked same question last weeks or so (just search the mailing list
archives for EnhanceIO :) and got some interesting answers.

Looks like the project is pretty much dead since it was bought out by HGST.
Even their website has some broken links in regards to EnhanceIO

Iâm keen to try flashcache or bcache (its been in the mainline kernel for
some time)

Dominik

On Wed, Jul 1, 2015 at 9:13 PM, German Anders <***@despegar.com> wrote:

> Hi cephers,
>
> Is anyone out there that implement enhanceIO in a production
> environment? any recommendation? any perf output to share with the diff
> between using it and not?
>
> Thanks in advance,
>
> *German*
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

Burkhard Linke

2015-07-02 07:06:47 UTC

Permalink

Hi,

On 07/01/2015 10:13 PM, German Anders wrote:
> Hi cephers,
>
> Is anyone out there that implement enhanceIO in a production
> environment? any recommendation? any perf output to share with the
> diff between using it and not?

I've used EnhanceIO as accelerator for our MySQL server, but I had to
discard it after a fatal kernel crash related to the module.

In my experience it works stable in write through mode, but write back
is buggy. Since the later one is the interesting one in almost any use
case, I would not recommend to use it.

Best regards,
Burkhard Linke

Emmanuel Florac

2015-07-02 09:29:49 UTC

Permalink

Le Wed, 1 Jul 2015 17:13:03 -0300
German Anders <***@despegar.com> écrivait:

> Hi cephers,
>
> Is anyone out there that implement enhanceIO in a production
> environment? any recommendation? any perf output to share with the
> diff between using it and not?

I've tried EnhanceIO back when it wasn't too stale, but never put it in
production. I've set up bcache on trial, it has its problems (load is
stuck at 1.0 because of the bcache_writeback kernel thread, and I
suspect a crash was due to it) but works pretty well overall.

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------

Jan Schermer

2015-07-02 10:27:56 UTC

Permalink

I think I posted my experience here ~1 month ago.

My advice for EnhanceIO: don’t use it.

But you didn’t exactly say what you want to cache - do you want to cache the OSD filestore disks? RBD devices on hosts? RBD devices inside guests?

Jan

> On 02 Jul 2015, at 11:29, Emmanuel Florac <***@intellique.com> wrote:
>
> Le Wed, 1 Jul 2015 17:13:03 -0300
> German Anders <***@despegar.com> écrivait:
>
>> Hi cephers,
>>
>> Is anyone out there that implement enhanceIO in a production
>> environment? any recommendation? any perf output to share with the
>> diff between using it and not?
>
> I've tried EnhanceIO back when it wasn't too stale, but never put it in
> production. I've set up bcache on trial, it has its problems (load is
> stuck at 1.0 because of the bcache_writeback kernel thread, and I
> suspect a crash was due to it) but works pretty well overall.
>
> --
> ------------------------------------------------------------------------
> Emmanuel Florac | Direction technique
> | Intellique
> | <***@intellique.com>
> | +33 1 78 94 84 02
> ------------------------------------------------------------------------
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

German Anders

2015-07-02 10:48:48 UTC

Permalink

The idea is to cache rbd at a host level. Also could be possible to cache
at the osd level. We have high iowait and we need to lower it a bit, since
we are getting the max from our sas disks 100-110 iops per disk (3TB
osd's), any advice? Flashcache?

On Thursday, July 2, 2015, Jan Schermer <***@schermer.cz> wrote:

> I think I posted my experience here ~1 month ago.
>
> My advice for EnhanceIO: donât use it.
>
> But you didnât exactly say what you want to cache - do you want to cache
> the OSD filestore disks? RBD devices on hosts? RBD devices inside guests?
>
> Jan
>
> > On 02 Jul 2015, at 11:29, Emmanuel Florac <***@intellique.com
> <javascript:;>> wrote:
> >
> > Le Wed, 1 Jul 2015 17:13:03 -0300
> > German Anders <***@despegar.com <javascript:;>> Ã©crivait:
> >
> >> Hi cephers,
> >>
> >> Is anyone out there that implement enhanceIO in a production
> >> environment? any recommendation? any perf output to share with the
> >> diff between using it and not?
> >
> > I've tried EnhanceIO back when it wasn't too stale, but never put it in
> > production. I've set up bcache on trial, it has its problems (load is
> > stuck at 1.0 because of the bcache_writeback kernel thread, and I
> > suspect a crash was due to it) but works pretty well overall.
> >
> > --
> > ------------------------------------------------------------------------
> > Emmanuel Florac | Direction technique
> > | Intellique
> > | <***@intellique.com <javascript:;>>
> > | +33 1 78 94 84 02
> > ------------------------------------------------------------------------
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com <javascript:;>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

--

*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* ***@despegar.com

Jan Schermer

2015-07-02 10:51:10 UTC

Permalink

Tune the OSDs or add more OSDs (if the problem is really in the disks).

Can you post iostat output for the disks that are loaded? (iostat -mx 1 /dev/sdX, a few linesâŠ)
What drives are those? What controller?

Jan

> On 02 Jul 2015, at 12:48, German Anders <***@despegar.com> wrote:
>
> The idea is to cache rbd at a host level. Also could be possible to cache at the osd level. We have high iowait and we need to lower it a bit, since we are getting the max from our sas disks 100-110 iops per disk (3TB osd's), any advice? Flashcache?
>
>
> On Thursday, July 2, 2015, Jan Schermer <***@schermer.cz <mailto:***@schermer.cz>> wrote:
> I think I posted my experience here ~1 month ago.
>
> My advice for EnhanceIO: donât use it.
>
> But you didnât exactly say what you want to cache - do you want to cache the OSD filestore disks? RBD devices on hosts? RBD devices inside guests?
>
> Jan
>
> > On 02 Jul 2015, at 11:29, Emmanuel Florac <***@intellique.com <javascript:;>> wrote:
> >
> > Le Wed, 1 Jul 2015 17:13:03 -0300
> > German Anders <***@despegar.com <javascript:;>> Ã©crivait:
> >
> >> Hi cephers,
> >>
> >> Is anyone out there that implement enhanceIO in a production
> >> environment? any recommendation? any perf output to share with the
> >> diff between using it and not?
> >
> > I've tried EnhanceIO back when it wasn't too stale, but never put it in
> > production. I've set up bcache on trial, it has its problems (load is
> > stuck at 1.0 because of the bcache_writeback kernel thread, and I
> > suspect a crash was due to it) but works pretty well overall.
> >
> > --
> > ------------------------------------------------------------------------
> > Emmanuel Florac | Direction technique
> > | Intellique
> > | <***@intellique.com <javascript:;>>
> > | +33 1 78 94 84 02
> > ------------------------------------------------------------------------
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com <javascript:;>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
> --
>
> German Anders
> Storage System Engineer Leader
> Despegar | IT Team
> office +54 11 4894 3500 x3408
> mobile +54 911 3493 7262
> mail ***@despegar.com <mailto:***@despegar.com>

Emmanuel Florac

2015-07-02 10:59:25 UTC

Permalink

bcache has the advantage of being natively integrated to the linux
kernel, feeling more "proper". It seems slightly faster than flashcache
too, but YMMV. However you cannot add bcache as an afterthought to an
existing volume, but you can set up flashcache this way apparently.
I had a few crashes with bcache on different machines but never had any
corruption, so it looks production-safe.

Le Thu, 2 Jul 2015 07:48:48 -0300
German Anders <***@despegar.com> écrivait:

> The idea is to cache rbd at a host level. Also could be possible to
> cache at the osd level. We have high iowait and we need to lower it a
> bit, since we are getting the max from our sas disks 100-110 iops per
> disk (3TB osd's), any advice? Flashcache?

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------

Jan Schermer

2015-07-02 11:32:23 UTC

Permalink

There are scripts to integrate existing device into bcache (not sure how well it works).

Jan

> On 02 Jul 2015, at 12:59, Emmanuel Florac <***@intellique.com> wrote:
>
>
> bcache has the advantage of being natively integrated to the linux
> kernel, feeling more "proper". It seems slightly faster than flashcache
> too, but YMMV. However you cannot add bcache as an afterthought to an
> existing volume, but you can set up flashcache this way apparently.
> I had a few crashes with bcache on different machines but never had any
> corruption, so it looks production-safe.
>
>
> Le Thu, 2 Jul 2015 07:48:48 -0300
> German Anders <***@despegar.com> Ã©crivait:
>
>> The idea is to cache rbd at a host level. Also could be possible to
>> cache at the osd level. We have high iowait and we need to lower it a
>> bit, since we are getting the max from our sas disks 100-110 iops per
>> disk (3TB osd's), any advice? Flashcache?
>
> --
> ------------------------------------------------------------------------
> Emmanuel Florac | Direction technique
> | Intellique
> | <***@intellique.com>
> | +33 1 78 94 84 02
> ------------------------------------------------------------------------

Lionel Bouton

2015-07-02 11:15:50 UTC

Permalink

On 07/02/15 12:48, German Anders wrote:
> The idea is to cache rbd at a host level. Also could be possible to
> cache at the osd level. We have high iowait and we need to lower it a
> bit, since we are getting the max from our sas disks 100-110 iops per
> disk (3TB osd's), any advice? Flashcache?

It's hard to suggest anything without knowing more about your setup. Are
your I/O mostly reads or writes? Reads: can you add enough RAM on your
guests or on your OSD to cache your working set? Writes: do you use SSD
for journals already?

Lionel

German Anders

2015-07-02 11:49:20 UTC

Permalink

output from iostat:

*CEPHOSD01:*

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdc(ceph-0) 0.00 0.00 1.00 389.00 0.00 35.98
188.96 60.32 120.12 16.00 120.39 1.26 49.20
sdd(ceph-1) 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf(ceph-2) 0.00 1.00 6.00 521.00 0.02 60.72
236.05 143.10 309.75 484.00 307.74 1.90 100.00
sdg(ceph-3) 0.00 0.00 11.00 535.00 0.04 42.41
159.22 139.25 279.72 394.18 277.37 1.83 100.00
sdi(ceph-4) 0.00 1.00 4.00 560.00 0.02 54.87
199.32 125.96 187.07 562.00 184.39 1.65 93.20
sdj(ceph-5) 0.00 0.00 0.00 566.00 0.00 61.41
222.19 109.13 169.62 0.00 169.62 1.53 86.40
sdl(ceph-6) 0.00 0.00 8.00 0.00 0.09 0.00
23.00 0.12 12.00 12.00 0.00 2.50 2.00
sdm(ceph-7) 0.00 0.00 2.00 481.00 0.01 44.59
189.12 116.64 241.41 268.00 241.30 2.05 99.20
sdn(ceph-8) 0.00 0.00 1.00 0.00 0.00 0.00
8.00 0.01 8.00 8.00 0.00 8.00 0.80
fioa 0.00 0.00 0.00 1016.00 0.00 19.09
38.47 0.00 0.06 0.00 0.06 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdc(ceph-0) 0.00 1.00 10.00 278.00 0.04 26.07
185.69 60.82 257.97 309.60 256.12 2.83 81.60
sdd(ceph-1) 0.00 0.00 2.00 0.00 0.02 0.00
20.00 0.02 10.00 10.00 0.00 10.00 2.00
sdf(ceph-2) 0.00 1.00 6.00 579.00 0.02 54.16
189.68 142.78 246.55 328.67 245.70 1.71 100.00
sdg(ceph-3) 0.00 0.00 10.00 75.00 0.05 5.32
129.41 4.94 185.08 11.20 208.27 4.05 34.40
sdi(ceph-4) 0.00 0.00 19.00 147.00 0.09 12.61
156.63 17.88 230.89 114.32 245.96 3.37 56.00
sdj(ceph-5) 0.00 1.00 2.00 629.00 0.01 43.66
141.72 143.00 223.35 426.00 222.71 1.58 100.00
sdl(ceph-6) 0.00 0.00 10.00 0.00 0.04 0.00
8.00 0.16 18.40 18.40 0.00 5.60 5.60
sdm(ceph-7) 0.00 0.00 11.00 4.00 0.05 0.01
8.00 0.48 35.20 25.82 61.00 14.13 21.20
sdn(ceph-8) 0.00 0.00 9.00 0.00 0.07 0.00
15.11 0.07 8.00 8.00 0.00 4.89 4.40
fioa 0.00 0.00 0.00 6415.00 0.00 125.81
40.16 0.00 0.14 0.00 0.14 0.00 0.00

*CEPHOSD02:*

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdc1(ceph-9) 0.00 0.00 13.00 0.00 0.11 0.00
16.62 0.17 13.23 13.23 0.00 4.92 6.40
sdd1(ceph-10) 0.00 0.00 15.00 0.00 0.13 0.00
18.13 0.26 17.33 17.33 0.00 1.87 2.80
sdf1(ceph-11) 0.00 0.00 22.00 650.00 0.11 51.75
158.04 143.27 212.07 308.55 208.81 1.49 100.00
sdg1(ceph-12) 0.00 0.00 12.00 282.00 0.05 54.60
380.68 13.16 120.52 352.00 110.67 2.91 85.60
sdi1(ceph-13) 0.00 0.00 1.00 0.00 0.00 0.00
8.00 0.01 8.00 8.00 0.00 8.00 0.80
sdj1(ceph-14) 0.00 0.00 20.00 0.00 0.08 0.00
8.00 0.26 12.80 12.80 0.00 3.60 7.20
sdl1(ceph-15) 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm1(ceph-16) 0.00 0.00 20.00 424.00 0.11 32.20
149.05 89.69 235.30 243.00 234.93 2.14 95.20
sdn1(ceph-17) 0.00 0.00 5.00 411.00 0.02 45.47
223.94 98.32 182.28 1057.60 171.63 2.40 100.00

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdc1(ceph-9) 0.00 0.00 26.00 383.00 0.11 34.32
172.44 86.92 258.64 297.08 256.03 2.29 93.60
sdd1(ceph-10) 0.00 0.00 8.00 31.00 0.09 1.86
101.95 0.84 178.15 94.00 199.87 6.46 25.20
sdf1(ceph-11) 0.00 1.00 5.00 409.00 0.05 48.34
239.34 90.94 219.43 383.20 217.43 2.34 96.80
sdg1(ceph-12) 0.00 0.00 0.00 238.00 0.00 1.64
14.12 58.34 143.60 0.00 143.60 1.83 43.60
sdi1(ceph-13) 0.00 0.00 11.00 0.00 0.05 0.00
10.18 0.16 14.18 14.18 0.00 5.09 5.60
sdj1(ceph-14) 0.00 0.00 1.00 0.00 0.00 0.00
8.00 0.02 16.00 16.00 0.00 16.00 1.60
sdl1(ceph-15) 0.00 0.00 1.00 0.00 0.03 0.00
64.00 0.01 12.00 12.00 0.00 12.00 1.20
sdm1(ceph-16) 0.00 1.00 4.00 587.00 0.03 50.09
173.69 143.32 244.97 296.00 244.62 1.69 100.00
sdn1(ceph-17) 0.00 0.00 0.00 375.00 0.00 23.68
129.34 69.76 182.51 0.00 182.51 2.47 92.80

The other OSD server had pretty much the same load.

The config of the OSD's is the following:

- 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
- 128G RAM
- 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
- 2x 10GbE ADPT DP
- Journals are configured to run on RAMDISK (TMPFS), but in the first OSD
serv we've the journals going on to a FusionIO (/dev/fioa) ADPT with batt.

CRUSH algorithm is the following:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host cephosd03 {
id -4 # do not change unnecessarily
# weight 24.570
alg straw
hash 0 # rjenkins1
item osd.18 weight 2.730
item osd.19 weight 2.730
item osd.20 weight 2.730
item osd.21 weight 2.730
item osd.22 weight 2.730
item osd.23 weight 2.730
item osd.24 weight 2.730
item osd.25 weight 2.730
item osd.26 weight 2.730
}
host cephosd04 {
id -5 # do not change unnecessarily
# weight 24.570
alg straw
hash 0 # rjenkins1
item osd.27 weight 2.730
item osd.28 weight 2.730
item osd.29 weight 2.730
item osd.30 weight 2.730
item osd.31 weight 2.730
item osd.32 weight 2.730
item osd.33 weight 2.730
item osd.34 weight 2.730
item osd.35 weight 2.730
}
root default {
id -1 # do not change unnecessarily
# weight 49.140
alg straw
hash 0 # rjenkins1
item cephosd03 weight 24.570
item cephosd04 weight 24.570
}
host cephosd01 {
id -2 # do not change unnecessarily
# weight 24.570
alg straw
hash 0 # rjenkins1
item osd.0 weight 2.730
item osd.1 weight 2.730
item osd.2 weight 2.730
item osd.3 weight 2.730
item osd.4 weight 2.730
item osd.5 weight 2.730
item osd.6 weight 2.730
item osd.7 weight 2.730
item osd.8 weight 2.730
}
host cephosd02 {
id -3 # do not change unnecessarily
# weight 24.570
alg straw
hash 0 # rjenkins1
item osd.9 weight 2.730
item osd.10 weight 2.730
item osd.11 weight 2.730
item osd.12 weight 2.730
item osd.13 weight 2.730
item osd.14 weight 2.730
item osd.15 weight 2.730
item osd.16 weight 2.730
item osd.17 weight 2.730
}
root fusionio {
id -6 # do not change unnecessarily
# weight 49.140
alg straw
hash 0 # rjenkins1
item cephosd01 weight 24.570
item cephosd02 weight 24.570
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule fusionio_ruleset {
ruleset 1
type replicated
min_size 0
max_size 10
step take fusionio
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn -1 type host
step emit
}

# end crush map

*German*

2015-07-02 8:15 GMT-03:00 Lionel Bouton <lionel+***@bouton.name>:

> On 07/02/15 12:48, German Anders wrote:
> > The idea is to cache rbd at a host level. Also could be possible to
> > cache at the osd level. We have high iowait and we need to lower it a
> > bit, since we are getting the max from our sas disks 100-110 iops per
> > disk (3TB osd's), any advice? Flashcache?
>
> It's hard to suggest anything without knowing more about your setup. Are
> your I/O mostly reads or writes? Reads: can you add enough RAM on your
> guests or on your OSD to cache your working set? Writes: do you use SSD
> for journals already?
>
> Lionel
>

Jan Schermer

2015-07-02 12:04:05 UTC

Permalink

And those disks are spindles?
Looks like thereâs simply too few of thereâŠ.

Jan

> On 02 Jul 2015, at 13:49, German Anders <***@despegar.com> wrote:
>
> output from iostat:
>
> CEPHOSD01:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc(ceph-0) 0.00 0.00 1.00 389.00 0.00 35.98 188.96 60.32 120.12 16.00 120.39 1.26 49.20
> sdd(ceph-1) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdf(ceph-2) 0.00 1.00 6.00 521.00 0.02 60.72 236.05 143.10 309.75 484.00 307.74 1.90 100.00
> sdg(ceph-3) 0.00 0.00 11.00 535.00 0.04 42.41 159.22 139.25 279.72 394.18 277.37 1.83 100.00
> sdi(ceph-4) 0.00 1.00 4.00 560.00 0.02 54.87 199.32 125.96 187.07 562.00 184.39 1.65 93.20
> sdj(ceph-5) 0.00 0.00 0.00 566.00 0.00 61.41 222.19 109.13 169.62 0.00 169.62 1.53 86.40
> sdl(ceph-6) 0.00 0.00 8.00 0.00 0.09 0.00 23.00 0.12 12.00 12.00 0.00 2.50 2.00
> sdm(ceph-7) 0.00 0.00 2.00 481.00 0.01 44.59 189.12 116.64 241.41 268.00 241.30 2.05 99.20
> sdn(ceph-8) 0.00 0.00 1.00 0.00 0.00 0.00 8.00 0.01 8.00 8.00 0.00 8.00 0.80
> fioa 0.00 0.00 0.00 1016.00 0.00 19.09 38.47 0.00 0.06 0.00 0.06 0.00 0.00
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc(ceph-0) 0.00 1.00 10.00 278.00 0.04 26.07 185.69 60.82 257.97 309.60 256.12 2.83 81.60
> sdd(ceph-1) 0.00 0.00 2.00 0.00 0.02 0.00 20.00 0.02 10.00 10.00 0.00 10.00 2.00
> sdf(ceph-2) 0.00 1.00 6.00 579.00 0.02 54.16 189.68 142.78 246.55 328.67 245.70 1.71 100.00
> sdg(ceph-3) 0.00 0.00 10.00 75.00 0.05 5.32 129.41 4.94 185.08 11.20 208.27 4.05 34.40
> sdi(ceph-4) 0.00 0.00 19.00 147.00 0.09 12.61 156.63 17.88 230.89 114.32 245.96 3.37 56.00
> sdj(ceph-5) 0.00 1.00 2.00 629.00 0.01 43.66 141.72 143.00 223.35 426.00 222.71 1.58 100.00
> sdl(ceph-6) 0.00 0.00 10.00 0.00 0.04 0.00 8.00 0.16 18.40 18.40 0.00 5.60 5.60
> sdm(ceph-7) 0.00 0.00 11.00 4.00 0.05 0.01 8.00 0.48 35.20 25.82 61.00 14.13 21.20
> sdn(ceph-8) 0.00 0.00 9.00 0.00 0.07 0.00 15.11 0.07 8.00 8.00 0.00 4.89 4.40
> fioa 0.00 0.00 0.00 6415.00 0.00 125.81 40.16 0.00 0.14 0.00 0.14 0.00 0.00
>
> CEPHOSD02:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc1(ceph-9) 0.00 0.00 13.00 0.00 0.11 0.00 16.62 0.17 13.23 13.23 0.00 4.92 6.40
> sdd1(ceph-10) 0.00 0.00 15.00 0.00 0.13 0.00 18.13 0.26 17.33 17.33 0.00 1.87 2.80
> sdf1(ceph-11) 0.00 0.00 22.00 650.00 0.11 51.75 158.04 143.27 212.07 308.55 208.81 1.49 100.00
> sdg1(ceph-12) 0.00 0.00 12.00 282.00 0.05 54.60 380.68 13.16 120.52 352.00 110.67 2.91 85.60
> sdi1(ceph-13) 0.00 0.00 1.00 0.00 0.00 0.00 8.00 0.01 8.00 8.00 0.00 8.00 0.80
> sdj1(ceph-14) 0.00 0.00 20.00 0.00 0.08 0.00 8.00 0.26 12.80 12.80 0.00 3.60 7.20
> sdl1(ceph-15) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdm1(ceph-16) 0.00 0.00 20.00 424.00 0.11 32.20 149.05 89.69 235.30 243.00 234.93 2.14 95.20
> sdn1(ceph-17) 0.00 0.00 5.00 411.00 0.02 45.47 223.94 98.32 182.28 1057.60 171.63 2.40 100.00
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc1(ceph-9) 0.00 0.00 26.00 383.00 0.11 34.32 172.44 86.92 258.64 297.08 256.03 2.29 93.60
> sdd1(ceph-10) 0.00 0.00 8.00 31.00 0.09 1.86 101.95 0.84 178.15 94.00 199.87 6.46 25.20
> sdf1(ceph-11) 0.00 1.00 5.00 409.00 0.05 48.34 239.34 90.94 219.43 383.20 217.43 2.34 96.80
> sdg1(ceph-12) 0.00 0.00 0.00 238.00 0.00 1.64 14.12 58.34 143.60 0.00 143.60 1.83 43.60
> sdi1(ceph-13) 0.00 0.00 11.00 0.00 0.05 0.00 10.18 0.16 14.18 14.18 0.00 5.09 5.60
> sdj1(ceph-14) 0.00 0.00 1.00 0.00 0.00 0.00 8.00 0.02 16.00 16.00 0.00 16.00 1.60
> sdl1(ceph-15) 0.00 0.00 1.00 0.00 0.03 0.00 64.00 0.01 12.00 12.00 0.00 12.00 1.20
> sdm1(ceph-16) 0.00 1.00 4.00 587.00 0.03 50.09 173.69 143.32 244.97 296.00 244.62 1.69 100.00
> sdn1(ceph-17) 0.00 0.00 0.00 375.00 0.00 23.68 129.34 69.76 182.51 0.00 182.51 2.47 92.80
>
> The other OSD server had pretty much the same load.
>
> The config of the OSD's is the following:
>
> - 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
> - 128G RAM
> - 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
> - 2x 10GbE ADPT DP
> - Journals are configured to run on RAMDISK (TMPFS), but in the first OSD serv we've the journals going on to a FusionIO (/dev/fioa) ADPT with batt.
>
> CRUSH algorithm is the following:
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> device 28 osd.28
> device 29 osd.29
> device 30 osd.30
> device 31 osd.31
> device 32 osd.32
> device 33 osd.33
> device 34 osd.34
> device 35 osd.35
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host cephosd03 {
> id -4 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.18 weight 2.730
> item osd.19 weight 2.730
> item osd.20 weight 2.730
> item osd.21 weight 2.730
> item osd.22 weight 2.730
> item osd.23 weight 2.730
> item osd.24 weight 2.730
> item osd.25 weight 2.730
> item osd.26 weight 2.730
> }
> host cephosd04 {
> id -5 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.27 weight 2.730
> item osd.28 weight 2.730
> item osd.29 weight 2.730
> item osd.30 weight 2.730
> item osd.31 weight 2.730
> item osd.32 weight 2.730
> item osd.33 weight 2.730
> item osd.34 weight 2.730
> item osd.35 weight 2.730
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 49.140
> alg straw
> hash 0 # rjenkins1
> item cephosd03 weight 24.570
> item cephosd04 weight 24.570
> }
> host cephosd01 {
> id -2 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 2.730
> item osd.1 weight 2.730
> item osd.2 weight 2.730
> item osd.3 weight 2.730
> item osd.4 weight 2.730
> item osd.5 weight 2.730
> item osd.6 weight 2.730
> item osd.7 weight 2.730
> item osd.8 weight 2.730
> }
> host cephosd02 {
> id -3 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.9 weight 2.730
> item osd.10 weight 2.730
> item osd.11 weight 2.730
> item osd.12 weight 2.730
> item osd.13 weight 2.730
> item osd.14 weight 2.730
> item osd.15 weight 2.730
> item osd.16 weight 2.730
> item osd.17 weight 2.730
> }
> root fusionio {
> id -6 # do not change unnecessarily
> # weight 49.140
> alg straw
> hash 0 # rjenkins1
> item cephosd01 weight 24.570
> item cephosd02 weight 24.570
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule fusionio_ruleset {
> ruleset 1
> type replicated
> min_size 0
> max_size 10
> step take fusionio
> step chooseleaf firstn 1 type host
> step emit
> step take default
> step chooseleaf firstn -1 type host
> step emit
> }
>
> # end crush map
>
>
>
>
>
>
> German
>
> 2015-07-02 8:15 GMT-03:00 Lionel Bouton <lionel+***@bouton.name <mailto:lionel+***@bouton.name>>:
> On 07/02/15 12:48, German Anders wrote:
> > The idea is to cache rbd at a host level. Also could be possible to
> > cache at the osd level. We have high iowait and we need to lower it a
> > bit, since we are getting the max from our sas disks 100-110 iops per
> > disk (3TB osd's), any advice? Flashcache?
>
> It's hard to suggest anything without knowing more about your setup. Are
> your I/O mostly reads or writes? Reads: can you add enough RAM on your
> guests or on your OSD to cache your working set? Writes: do you use SSD
> for journals already?
>
> Lionel
>

German Anders

2015-07-02 12:05:47 UTC

Permalink

yeah 3TB SAS disks

*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* ***@despegar.com

2015-07-02 9:04 GMT-03:00 Jan Schermer <***@schermer.cz>:

> And those disks are spindles?
> Looks like thereâs simply too few of thereâŠ.
>
> Jan
>
> On 02 Jul 2015, at 13:49, German Anders <***@despegar.com> wrote:
>
> output from iostat:
>
> *CEPHOSD01:*
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sdc(ceph-0) 0.00 0.00 1.00 389.00 0.00 35.98
> 188.96 60.32 120.12 16.00 120.39 1.26 49.20
> sdd(ceph-1) 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdf(ceph-2) 0.00 1.00 6.00 521.00 0.02 60.72
> 236.05 143.10 309.75 484.00 307.74 1.90 100.00
> sdg(ceph-3) 0.00 0.00 11.00 535.00 0.04 42.41
> 159.22 139.25 279.72 394.18 277.37 1.83 100.00
> sdi(ceph-4) 0.00 1.00 4.00 560.00 0.02 54.87
> 199.32 125.96 187.07 562.00 184.39 1.65 93.20
> sdj(ceph-5) 0.00 0.00 0.00 566.00 0.00 61.41
> 222.19 109.13 169.62 0.00 169.62 1.53 86.40
> sdl(ceph-6) 0.00 0.00 8.00 0.00 0.09 0.00
> 23.00 0.12 12.00 12.00 0.00 2.50 2.00
> sdm(ceph-7) 0.00 0.00 2.00 481.00 0.01 44.59
> 189.12 116.64 241.41 268.00 241.30 2.05 99.20
> sdn(ceph-8) 0.00 0.00 1.00 0.00 0.00 0.00
> 8.00 0.01 8.00 8.00 0.00 8.00 0.80
> fioa 0.00 0.00 0.00 1016.00 0.00 19.09
> 38.47 0.00 0.06 0.00 0.06 0.00 0.00
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sdc(ceph-0) 0.00 1.00 10.00 278.00 0.04 26.07
> 185.69 60.82 257.97 309.60 256.12 2.83 81.60
> sdd(ceph-1) 0.00 0.00 2.00 0.00 0.02 0.00
> 20.00 0.02 10.00 10.00 0.00 10.00 2.00
> sdf(ceph-2) 0.00 1.00 6.00 579.00 0.02 54.16
> 189.68 142.78 246.55 328.67 245.70 1.71 100.00
> sdg(ceph-3) 0.00 0.00 10.00 75.00 0.05 5.32
> 129.41 4.94 185.08 11.20 208.27 4.05 34.40
> sdi(ceph-4) 0.00 0.00 19.00 147.00 0.09 12.61
> 156.63 17.88 230.89 114.32 245.96 3.37 56.00
> sdj(ceph-5) 0.00 1.00 2.00 629.00 0.01 43.66
> 141.72 143.00 223.35 426.00 222.71 1.58 100.00
> sdl(ceph-6) 0.00 0.00 10.00 0.00 0.04 0.00
> 8.00 0.16 18.40 18.40 0.00 5.60 5.60
> sdm(ceph-7) 0.00 0.00 11.00 4.00 0.05 0.01
> 8.00 0.48 35.20 25.82 61.00 14.13 21.20
> sdn(ceph-8) 0.00 0.00 9.00 0.00 0.07 0.00
> 15.11 0.07 8.00 8.00 0.00 4.89 4.40
> fioa 0.00 0.00 0.00 6415.00 0.00 125.81
> 40.16 0.00 0.14 0.00 0.14 0.00 0.00
>
> *CEPHOSD02:*
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sdc1(ceph-9) 0.00 0.00 13.00 0.00 0.11 0.00
> 16.62 0.17 13.23 13.23 0.00 4.92 6.40
> sdd1(ceph-10) 0.00 0.00 15.00 0.00 0.13 0.00
> 18.13 0.26 17.33 17.33 0.00 1.87 2.80
> sdf1(ceph-11) 0.00 0.00 22.00 650.00 0.11 51.75
> 158.04 143.27 212.07 308.55 208.81 1.49 100.00
> sdg1(ceph-12) 0.00 0.00 12.00 282.00 0.05 54.60
> 380.68 13.16 120.52 352.00 110.67 2.91 85.60
> sdi1(ceph-13) 0.00 0.00 1.00 0.00 0.00 0.00
> 8.00 0.01 8.00 8.00 0.00 8.00 0.80
> sdj1(ceph-14) 0.00 0.00 20.00 0.00 0.08 0.00
> 8.00 0.26 12.80 12.80 0.00 3.60 7.20
> sdl1(ceph-15) 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdm1(ceph-16) 0.00 0.00 20.00 424.00 0.11 32.20
> 149.05 89.69 235.30 243.00 234.93 2.14 95.20
> sdn1(ceph-17) 0.00 0.00 5.00 411.00 0.02 45.47
> 223.94 98.32 182.28 1057.60 171.63 2.40 100.00
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sdc1(ceph-9) 0.00 0.00 26.00 383.00 0.11 34.32
> 172.44 86.92 258.64 297.08 256.03 2.29 93.60
> sdd1(ceph-10) 0.00 0.00 8.00 31.00 0.09 1.86
> 101.95 0.84 178.15 94.00 199.87 6.46 25.20
> sdf1(ceph-11) 0.00 1.00 5.00 409.00 0.05 48.34
> 239.34 90.94 219.43 383.20 217.43 2.34 96.80
> sdg1(ceph-12) 0.00 0.00 0.00 238.00 0.00 1.64
> 14.12 58.34 143.60 0.00 143.60 1.83 43.60
> sdi1(ceph-13) 0.00 0.00 11.00 0.00 0.05 0.00
> 10.18 0.16 14.18 14.18 0.00 5.09 5.60
> sdj1(ceph-14) 0.00 0.00 1.00 0.00 0.00 0.00
> 8.00 0.02 16.00 16.00 0.00 16.00 1.60
> sdl1(ceph-15) 0.00 0.00 1.00 0.00 0.03 0.00
> 64.00 0.01 12.00 12.00 0.00 12.00 1.20
> sdm1(ceph-16) 0.00 1.00 4.00 587.00 0.03 50.09
> 173.69 143.32 244.97 296.00 244.62 1.69 100.00
> sdn1(ceph-17) 0.00 0.00 0.00 375.00 0.00 23.68
> 129.34 69.76 182.51 0.00 182.51 2.47 92.80
>
> The other OSD server had pretty much the same load.
>
> The config of the OSD's is the following:
>
> - 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
> - 128G RAM
> - 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
> - 2x 10GbE ADPT DP
> - Journals are configured to run on RAMDISK (TMPFS), but in the first OSD
> serv we've the journals going on to a FusionIO (/dev/fioa) ADPT with batt.
>
> CRUSH algorithm is the following:
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> device 28 osd.28
> device 29 osd.29
> device 30 osd.30
> device 31 osd.31
> device 32 osd.32
> device 33 osd.33
> device 34 osd.34
> device 35 osd.35
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host cephosd03 {
> id -4 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.18 weight 2.730
> item osd.19 weight 2.730
> item osd.20 weight 2.730
> item osd.21 weight 2.730
> item osd.22 weight 2.730
> item osd.23 weight 2.730
> item osd.24 weight 2.730
> item osd.25 weight 2.730
> item osd.26 weight 2.730
> }
> host cephosd04 {
> id -5 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.27 weight 2.730
> item osd.28 weight 2.730
> item osd.29 weight 2.730
> item osd.30 weight 2.730
> item osd.31 weight 2.730
> item osd.32 weight 2.730
> item osd.33 weight 2.730
> item osd.34 weight 2.730
> item osd.35 weight 2.730
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 49.140
> alg straw
> hash 0 # rjenkins1
> item cephosd03 weight 24.570
> item cephosd04 weight 24.570
> }
> host cephosd01 {
> id -2 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 2.730
> item osd.1 weight 2.730
> item osd.2 weight 2.730
> item osd.3 weight 2.730
> item osd.4 weight 2.730
> item osd.5 weight 2.730
> item osd.6 weight 2.730
> item osd.7 weight 2.730
> item osd.8 weight 2.730
> }
> host cephosd02 {
> id -3 # do not change unnecessarily
> # weight 24.570
> alg straw
> hash 0 # rjenkins1
> item osd.9 weight 2.730
> item osd.10 weight 2.730
> item osd.11 weight 2.730
> item osd.12 weight 2.730
> item osd.13 weight 2.730
> item osd.14 weight 2.730
> item osd.15 weight 2.730
> item osd.16 weight 2.730
> item osd.17 weight 2.730
> }
> root fusionio {
> id -6 # do not change unnecessarily
> # weight 49.140
> alg straw
> hash 0 # rjenkins1
> item cephosd01 weight 24.570
> item cephosd02 weight 24.570
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule fusionio_ruleset {
> ruleset 1
> type replicated
> min_size 0
> max_size 10
> step take fusionio
> step chooseleaf firstn 1 type host
> step emit
> step take default
> step chooseleaf firstn -1 type host
> step emit
> }
>
> # end crush map
>
>
>
>
>
> *German*
>
> 2015-07-02 8:15 GMT-03:00 Lionel Bouton <lionel+***@bouton.name>:
>
>> On 07/02/15 12:48, German Anders wrote:
>> > The idea is to cache rbd at a host level. Also could be possible to
>> > cache at the osd level. We have high iowait and we need to lower it a
>> > bit, since we are getting the max from our sas disks 100-110 iops per
>> > disk (3TB osd's), any advice? Flashcache?
>>
>> It's hard to suggest anything without knowing more about your setup. Are
>> your I/O mostly reads or writes? Reads: can you add enough RAM on your
>> guests or on your OSD to cache your working set? Writes: do you use SSD
>> for journals already?
>>
>> Lionel
>>
>
>
>

Lionel Bouton

2015-07-02 12:48:18 UTC

Permalink

On 07/02/15 13:49, German Anders wrote:
> output from iostat:
>
> CEPHOSD01:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc(ceph-0) 0.00 0.00 1.00 389.00 0.00 35.98
> 188.96 60.32 120.12 16.00 120.39 1.26 49.20
> sdd(ceph-1) 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdf(ceph-2) 0.00 1.00 6.00 521.00 0.02 60.72
> 236.05 143.10 309.75 484.00 307.74 1.90 100.00
> sdg(ceph-3) 0.00 0.00 11.00 535.00 0.04 42.41
> 159.22 139.25 279.72 394.18 277.37 1.83 100.00
> sdi(ceph-4) 0.00 1.00 4.00 560.00 0.02 54.87
> 199.32 125.96 187.07 562.00 184.39 1.65 93.20
> sdj(ceph-5) 0.00 0.00 0.00 566.00 0.00 61.41
> 222.19 109.13 169.62 0.00 169.62 1.53 86.40
> sdl(ceph-6) 0.00 0.00 8.00 0.00 0.09 0.00
> 23.00 0.12 12.00 12.00 0.00 2.50 2.00
> sdm(ceph-7) 0.00 0.00 2.00 481.00 0.01 44.59
> 189.12 116.64 241.41 268.00 241.30 2.05 99.20
> sdn(ceph-8) 0.00 0.00 1.00 0.00 0.00 0.00
> 8.00 0.01 8.00 8.00 0.00 8.00 0.80
> fioa 0.00 0.00 0.00 1016.00 0.00 19.09
> 38.47 0.00 0.06 0.00 0.06 0.00 0.00
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc(ceph-0) 0.00 1.00 10.00 278.00 0.04 26.07
> 185.69 60.82 257.97 309.60 256.12 2.83 81.60
> sdd(ceph-1) 0.00 0.00 2.00 0.00 0.02 0.00
> 20.00 0.02 10.00 10.00 0.00 10.00 2.00
> sdf(ceph-2) 0.00 1.00 6.00 579.00 0.02 54.16
> 189.68 142.78 246.55 328.67 245.70 1.71 100.00
> sdg(ceph-3) 0.00 0.00 10.00 75.00 0.05 5.32
> 129.41 4.94 185.08 11.20 208.27 4.05 34.40
> sdi(ceph-4) 0.00 0.00 19.00 147.00 0.09 12.61
> 156.63 17.88 230.89 114.32 245.96 3.37 56.00
> sdj(ceph-5) 0.00 1.00 2.00 629.00 0.01 43.66
> 141.72 143.00 223.35 426.00 222.71 1.58 100.00
> sdl(ceph-6) 0.00 0.00 10.00 0.00 0.04 0.00
> 8.00 0.16 18.40 18.40 0.00 5.60 5.60
> sdm(ceph-7) 0.00 0.00 11.00 4.00 0.05 0.01
> 8.00 0.48 35.20 25.82 61.00 14.13 21.20
> sdn(ceph-8) 0.00 0.00 9.00 0.00 0.07 0.00
> 15.11 0.07 8.00 8.00 0.00 4.89 4.40
> fioa 0.00 0.00 0.00 6415.00 0.00 125.81
> 40.16 0.00 0.14 0.00 0.14 0.00 0.00
>
> CEPHOSD02:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc1(ceph-9) 0.00 0.00 13.00 0.00 0.11 0.00
> 16.62 0.17 13.23 13.23 0.00 4.92 6.40
> sdd1(ceph-10) 0.00 0.00 15.00 0.00 0.13 0.00
> 18.13 0.26 17.33 17.33 0.00 1.87 2.80
> sdf1(ceph-11) 0.00 0.00 22.00 650.00 0.11 51.75
> 158.04 143.27 212.07 308.55 208.81 1.49 100.00
> sdg1(ceph-12) 0.00 0.00 12.00 282.00 0.05 54.60
> 380.68 13.16 120.52 352.00 110.67 2.91 85.60
> sdi1(ceph-13) 0.00 0.00 1.00 0.00 0.00 0.00
> 8.00 0.01 8.00 8.00 0.00 8.00 0.80
> sdj1(ceph-14) 0.00 0.00 20.00 0.00 0.08 0.00
> 8.00 0.26 12.80 12.80 0.00 3.60 7.20
> sdl1(ceph-15) 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdm1(ceph-16) 0.00 0.00 20.00 424.00 0.11 32.20
> 149.05 89.69 235.30 243.00 234.93 2.14 95.20
> sdn1(ceph-17) 0.00 0.00 5.00 411.00 0.02 45.47
> 223.94 98.32 182.28 1057.60 171.63 2.40 100.00
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc1(ceph-9) 0.00 0.00 26.00 383.00 0.11 34.32
> 172.44 86.92 258.64 297.08 256.03 2.29 93.60
> sdd1(ceph-10) 0.00 0.00 8.00 31.00 0.09 1.86
> 101.95 0.84 178.15 94.00 199.87 6.46 25.20
> sdf1(ceph-11) 0.00 1.00 5.00 409.00 0.05 48.34
> 239.34 90.94 219.43 383.20 217.43 2.34 96.80
> sdg1(ceph-12) 0.00 0.00 0.00 238.00 0.00 1.64
> 14.12 58.34 143.60 0.00 143.60 1.83 43.60
> sdi1(ceph-13) 0.00 0.00 11.00 0.00 0.05 0.00
> 10.18 0.16 14.18 14.18 0.00 5.09 5.60
> sdj1(ceph-14) 0.00 0.00 1.00 0.00 0.00 0.00
> 8.00 0.02 16.00 16.00 0.00 16.00 1.60
> sdl1(ceph-15) 0.00 0.00 1.00 0.00 0.03 0.00
> 64.00 0.01 12.00 12.00 0.00 12.00 1.20
> sdm1(ceph-16) 0.00 1.00 4.00 587.00 0.03 50.09
> 173.69 143.32 244.97 296.00 244.62 1.69 100.00
> sdn1(ceph-17) 0.00 0.00 0.00 375.00 0.00 23.68
> 129.34 69.76 182.51 0.00 182.51 2.47 92.80

If the iostat output is typical it seems you are limited by random
writes on a subset of your OSDs (you have 9 on each server but you have
between 4 and 6 used for writes and wMB/s vs w/s points to a moderately
random access pattern).
You should find out why. You may have a configuration problem or the
access to your rbds is focused on a few 4MB (if you used the defaults)
sections of the devices.

>
> The other OSD server had pretty much the same load.
>
> The config of the OSD's is the following:
>
> - 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
> - 128G RAM
> - 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
> - 2x 10GbE ADPT DP
> - Journals are configured to run on RAMDISK (TMPFS), but in the first
> OSD serv we've the journals going on to a FusionIO (/dev/fioa) ADPT
> with batt.

I suppose this is not yet production (TMPFS journals). You only have
128G RAM for 9 OSD, what is the size of your journals when you use TMPFS
and more importantly what is the value of filestore sync max interval ?
I'm not sure how the OSD will react with a journal with a multi-GB/s
write bandwidth: the default filestore sync max interval might be too
high (it should prevent the journal from filling up). On the other end a
low max interval will prevent the OS from reordering writes to hard
drives to avoid too much random IO.

So there might be two causes I can see that might lead to performance
problems:
- the IO load might not be distributed to all your OSD limiting your
total bandwidth,
- you might have IO freezes when your TMPFS journals fill up if you have
very high bursts (probably unlikely but the consequences might be dire).

Another problem I see is cost : Fusion IO speed and cost (and TMPFS
speed) are probably overkill for the journals in your case. With your
setup 2x Intel DC S3500 would probably be enough (unless you need more
write endurance).
With what you save not using a Fusion IO card in each server you could
probably have additional servers and get far better performance overall.

If you do, use a 10GB journal size and a filestore max sync interval
allowing only half of it to be written to. With 2x 500MB/s write
bandwidth divided between 9 balanced OSD this would be 110MB/s so you
could use 30s with room to spare.

This assumes you can distribute IOs to all OSDs, you might have to
convert your rbds to a lower order or use stripping to achieve this if
you have atypical access patterns.

Lionel

Dominik Zalewski

2015-07-01 21:28:10 UTC

Permalink

Hi,

Iâve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers.

Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO

Iâm keen to try flashcache or bcache (its been in the mainline kernel for some time)

Dominik

> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>
> Hi cephers,
>
> Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not?
>
> Thanks in advance,
>
>
> German
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Daniel Gryniewicz

2015-07-23 15:02:25 UTC

Permalink

I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend.

Daniel

----- Original Message -----

From: "Dominik Zalewski" <***@optlink.net>
To: "German Anders" <***@despegar.com>
Cc: "ceph-users" <ceph-***@lists.ceph.com>
Sent: Wednesday, July 1, 2015 5:28:10 PM
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

Hi,

Iâve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers.

Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO

Iâm keen to try flashcache or bcache (its been in the mainline kernel for some time)

Dominik

On 1 Jul 2015, at 21:13, German Anders < ***@despegar.com > wrote:

Hi cephers,

Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not?

Thanks in advance,

German
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Alex Gorbachev

2015-08-17 23:49:21 UTC

Permalink

What about https://github.com/Frontier314/EnhanceIO? Last commit 2
months ago, but no external contributors :(

The nice thing about EnhanceIO is there is no need to change device
name, unlike bcache, flashcache etc.

Best regards,
Alex

On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com> wrote:
> I did some (non-ceph) work on these, and concluded that bcache was the best
> supported, most stable, and fastest. This was ~1 year ago, to take it with
> a grain of salt, but that's what I would recommend.
>
> Daniel
>
>
> ________________________________
> From: "Dominik Zalewski" <***@optlink.net>
> To: "German Anders" <***@despegar.com>
> Cc: "ceph-users" <ceph-***@lists.ceph.com>
> Sent: Wednesday, July 1, 2015 5:28:10 PM
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
>
> Hi,
>
> I’ve asked same question last weeks or so (just search the mailing list
> archives for EnhanceIO :) and got some interesting answers.
>
> Looks like the project is pretty much dead since it was bought out by HGST.
> Even their website has some broken links in regards to EnhanceIO
>
> I’m keen to try flashcache or bcache (its been in the mainline kernel for
> some time)
>
> Dominik
>
> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>
> Hi cephers,
>
> Is anyone out there that implement enhanceIO in a production environment?
> any recommendation? any perf output to share with the diff between using it
> and not?
>
> Thanks in advance,
>
> German
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Jan Schermer

2015-08-18 09:00:37 UTC

Permalink

I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly).
It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable).
If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk).

Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems.

bcache seemed to work fine, but I needed to
a) use it for root
b) disable and enable it on the fly (doh)
c) make it non-persisent (flush it) before reboot - not sure if that was possible either.
d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it...
So I haven't tested it heavily.

Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-)

Jan

> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>
> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
> months ago, but no external contributors :(
>
> The nice thing about EnhanceIO is there is no need to change device
> name, unlike bcache, flashcache etc.
>
> Best regards,
> Alex
>
> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com> wrote:
>> I did some (non-ceph) work on these, and concluded that bcache was the best
>> supported, most stable, and fastest. This was ~1 year ago, to take it with
>> a grain of salt, but that's what I would recommend.
>>
>> Daniel
>>
>>
>> ________________________________
>> From: "Dominik Zalewski" <***@optlink.net>
>> To: "German Anders" <***@despegar.com>
>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>
>>
>> Hi,
>>
>> I’ve asked same question last weeks or so (just search the mailing list
>> archives for EnhanceIO :) and got some interesting answers.
>>
>> Looks like the project is pretty much dead since it was bought out by HGST.
>> Even their website has some broken links in regards to EnhanceIO
>>
>> I’m keen to try flashcache or bcache (its been in the mainline kernel for
>> some time)
>>
>> Dominik
>>
>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>
>> Hi cephers,
>>
>> Is anyone out there that implement enhanceIO in a production environment?
>> any recommendation? any perf output to share with the diff between using it
>> and not?
>>
>> Thanks in advance,
>>
>> German
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Nick Fisk

2015-08-18 09:12:59 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 10:01
> To: Alex Gorbachev <***@iss-integration.com>
> Cc: Dominik Zalewski <***@optlink.net>; ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> I already evaluated EnhanceIO in combination with CentOS 6 (and
> backported 3.10 and 4.0 kernel-lt if I remember correctly).
> It worked fine during benchmarks and stress tests, but once we run DB2 on it
> it panicked within minutes and took all the data with it (almost literally - files
> that werent touched, like OS binaries were b0rked and the filesystem was
> unsalvageable).
> If you disregard this warning - the performance gains weren't that great
> either, at least in a VM. It had problems when flushing to disk after reaching
> dirty watermark and the block size has some not-well-documented
> implications (not sure now, but I think it only cached IO _larger_than the
> block size, so if your database keeps incrementing an XX-byte counter it will
> go straight to disk).
>
> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you
> than go for it, it should be stable and I used it in the past in production
> without problems.
>
> bcache seemed to work fine, but I needed to
> a) use it for root
> b) disable and enable it on the fly (doh)
> c) make it non-persisent (flush it) before reboot - not sure if that was
> possible either.
> d) all that in a customer's VM, and that customer didn't have a strong
> technical background to be able to fiddle with it...
> So I haven't tested it heavily.
>
> Bcache should be the obvious choice if you are in control of the environment.
> At least you can cry on LKML's shoulder when you lose data :-)

Please note, it looks like the main(only?) dev of Bcache has started making a new version of bcache, bcachefs. At this stage I'm not sure what this means for the ongoing support of the existing bcache project.

>
> Jan
>
>
> > On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
> >
> > What about https://github.com/Frontier314/EnhanceIO? Last commit 2
> > months ago, but no external contributors :(
> >
> > The nice thing about EnhanceIO is there is no need to change device
> > name, unlike bcache, flashcache etc.
> >
> > Best regards,
> > Alex
> >
> > On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
> wrote:
> >> I did some (non-ceph) work on these, and concluded that bcache was
> >> the best supported, most stable, and fastest. This was ~1 year ago,
> >> to take it with a grain of salt, but that's what I would recommend.
> >>
> >> Daniel
> >>
> >>
> >> ________________________________
> >> From: "Dominik Zalewski" <***@optlink.net>
> >> To: "German Anders" <***@despegar.com>
> >> Cc: "ceph-users" <ceph-***@lists.ceph.com>
> >> Sent: Wednesday, July 1, 2015 5:28:10 PM
> >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>
> >>
> >> Hi,
> >>
> >> I’ve asked same question last weeks or so (just search the mailing
> >> list archives for EnhanceIO :) and got some interesting answers.
> >>
> >> Looks like the project is pretty much dead since it was bought out by
> HGST.
> >> Even their website has some broken links in regards to EnhanceIO
> >>
> >> I’m keen to try flashcache or bcache (its been in the mainline kernel
> >> for some time)
> >>
> >> Dominik
> >>
> >> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
> >>
> >> Hi cephers,
> >>
> >> Is anyone out there that implement enhanceIO in a production
> environment?
> >> any recommendation? any perf output to share with the diff between
> >> using it and not?
> >>
> >> Thanks in advance,
> >>
> >> German
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Emmanuel Florac

2015-08-18 11:25:47 UTC

Permalink

Le Tue, 18 Aug 2015 10:12:59 +0100
Nick Fisk <***@fisk.me.uk> écrivait:

> > Bcache should be the obvious choice if you are in control of the
> > environment. At least you can cry on LKML's shoulder when you lose
> > data :-)
>
> Please note, it looks like the main(only?) dev of Bcache has started
> making a new version of bcache, bcachefs. At this stage I'm not sure
> what this means for the ongoing support of the existing bcache
> project.

bcachefs is more than a "new version of bcache", it's a complete POSIX
filesystem with integrated caching. Looks like a silly idea if you ask
me (because we already have several excellent filesystems; because
developing a reliable filesystem is DAMN HARD; because building a
feature-complete FS is CRAZY HARD; because FTL sucks anyway; etc).

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------

Nick Fisk

2015-08-18 12:06:21 UTC

Permalink

> -----Original Message-----
> From: Emmanuel Florac [mailto:***@intellique.com]
> Sent: 18 August 2015 12:26
> To: Nick Fisk <***@fisk.me.uk>
> Cc: 'Jan Schermer' <***@schermer.cz>; 'Alex Gorbachev' <***@iss-
> integration.com>; 'Dominik Zalewski' <***@optlink.net>; ceph-
> ***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> Le Tue, 18 Aug 2015 10:12:59 +0100
> Nick Fisk <***@fisk.me.uk> écrivait:
>
> > > Bcache should be the obvious choice if you are in control of the
> > > environment. At least you can cry on LKML's shoulder when you lose
> > > data :-)
> >
> > Please note, it looks like the main(only?) dev of Bcache has started
> > making a new version of bcache, bcachefs. At this stage I'm not sure
> > what this means for the ongoing support of the existing bcache
> > project.
>
> bcachefs is more than a "new version of bcache", it's a complete POSIX
> filesystem with integrated caching. Looks like a silly idea if you ask me
> (because we already have several excellent filesystems; because developing
> a reliable filesystem is DAMN HARD; because building a feature-complete FS
> is CRAZY HARD; because FTL sucks anyway; etc).

Agreed, it's such a shame that there isn't a simple, reliable and maintained caching solution out there for Linux. When I started seeing all these projects spring up 5-6 years ago I was full of optimism, but we still don't have anything I would call fully usable.

>
> --
> ------------------------------------------------------------------------
> Emmanuel Florac | Direction technique
> | Intellique
> | <***@intellique.com>
> | +33 1 78 94 84 02
> ------------------------------------------------------------------------

Mark Nelson

2015-08-18 11:33:02 UTC

Permalink

Hi Jan,

Out of curiosity did you ever try dm-cache? I've been meaning to give
it a spin but haven't had the spare cycles.

Mark

On 08/18/2015 04:00 AM, Jan Schermer wrote:
> I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly).
> It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable).
> If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk).
>
> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems.
>
> bcache seemed to work fine, but I needed to
> a) use it for root
> b) disable and enable it on the fly (doh)
> c) make it non-persisent (flush it) before reboot - not sure if that was possible either.
> d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it...
> So I haven't tested it heavily.
>
> Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-)
>
> Jan
>
>
>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>
>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>> months ago, but no external contributors :(
>>
>> The nice thing about EnhanceIO is there is no need to change device
>> name, unlike bcache, flashcache etc.
>>
>> Best regards,
>> Alex
>>
>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com> wrote:
>>> I did some (non-ceph) work on these, and concluded that bcache was the best
>>> supported, most stable, and fastest. This was ~1 year ago, to take it with
>>> a grain of salt, but that's what I would recommend.
>>>
>>> Daniel
>>>
>>>
>>> ________________________________
>>> From: "Dominik Zalewski" <***@optlink.net>
>>> To: "German Anders" <***@despegar.com>
>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>
>>>
>>> Hi,
>>>
>>> I’ve asked same question last weeks or so (just search the mailing list
>>> archives for EnhanceIO :) and got some interesting answers.
>>>
>>> Looks like the project is pretty much dead since it was bought out by HGST.
>>> Even their website has some broken links in regards to EnhanceIO
>>>
>>> I’m keen to try flashcache or bcache (its been in the mainline kernel for
>>> some time)
>>>
>>> Dominik
>>>
>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>
>>> Hi cephers,
>>>
>>> Is anyone out there that implement enhanceIO in a production environment?
>>> any recommendation? any perf output to share with the diff between using it
>>> and not?
>>>
>>> Thanks in advance,
>>>
>>> German
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Stefan Priebe - Profihost AG

2015-08-18 11:36:10 UTC

Permalink

We're using an extra caching layer for ceph since the beginning for our
older ceph deployments. All new deployments go with full SSDs.

I've tested so far:
- EnhanceIO
- Flashcache
- Bcache
- dm-cache
- dm-writeboost

The best working solution was and is bcache except for it's buggy code.
The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e.
discards result in crashed FS after reboots and so on. But it's still
the fastest for ceph.

The 2nd best solution which we already use in production is
dm-writeboost (https://github.com/akiradeveloper/dm-writeboost).

Everything else is too slow.

Stefan
Am 18.08.2015 um 13:33 schrieb Mark Nelson:
> Hi Jan,
>
> Out of curiosity did you ever try dm-cache? I've been meaning to give
> it a spin but haven't had the spare cycles.
>
> Mark
>
> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>> It worked fine during benchmarks and stress tests, but once we run DB2
>> on it it panicked within minutes and took all the data with it (almost
>> literally - files that werent touched, like OS binaries were b0rked
>> and the filesystem was unsalvageable).
>> If you disregard this warning - the performance gains weren't that
>> great either, at least in a VM. It had problems when flushing to disk
>> after reaching dirty watermark and the block size has some
>> not-well-documented implications (not sure now, but I think it only
>> cached IO _larger_than the block size, so if your database keeps
>> incrementing an XX-byte counter it will go straight to disk).
>>
>> Flashcache doesn't respect barriers (or does it now?) - if that's ok
>> for you than go for it, it should be stable and I used it in the past
>> in production without problems.
>>
>> bcache seemed to work fine, but I needed to
>> a) use it for root
>> b) disable and enable it on the fly (doh)
>> c) make it non-persisent (flush it) before reboot - not sure if that
>> was possible either.
>> d) all that in a customer's VM, and that customer didn't have a strong
>> technical background to be able to fiddle with it...
>> So I haven't tested it heavily.
>>
>> Bcache should be the obvious choice if you are in control of the
>> environment. At least you can cry on LKML's shoulder when you lose
>> data :-)
>>
>> Jan
>>
>>
>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>>
>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>> months ago, but no external contributors :(
>>>
>>> The nice thing about EnhanceIO is there is no need to change device
>>> name, unlike bcache, flashcache etc.
>>>
>>> Best regards,
>>> Alex
>>>
>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
>>> wrote:
>>>> I did some (non-ceph) work on these, and concluded that bcache was
>>>> the best
>>>> supported, most stable, and fastest. This was ~1 year ago, to take
>>>> it with
>>>> a grain of salt, but that's what I would recommend.
>>>>
>>>> Daniel
>>>>
>>>>
>>>> ________________________________
>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>> To: "German Anders" <***@despegar.com>
>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I’ve asked same question last weeks or so (just search the mailing list
>>>> archives for EnhanceIO :) and got some interesting answers.
>>>>
>>>> Looks like the project is pretty much dead since it was bought out
>>>> by HGST.
>>>> Even their website has some broken links in regards to EnhanceIO
>>>>
>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>> kernel for
>>>> some time)
>>>>
>>>> Dominik
>>>>
>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>>
>>>> Hi cephers,
>>>>
>>>> Is anyone out there that implement enhanceIO in a production
>>>> environment?
>>>> any recommendation? any perf output to share with the diff between
>>>> using it
>>>> and not?
>>>>
>>>> Thanks in advance,
>>>>
>>>> German
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Campbell, Bill

2015-08-18 13:43:24 UTC

Permalink

Hey Stefan,
Are you using your Ceph cluster for virtualization storage? Is dm-writeboost configured on the OSD nodes themselves?

----- Original Message -----

From: "Stefan Priebe - Profihost AG" <***@profihost.ag>
To: "Mark Nelson" <***@redhat.com>, ceph-***@lists.ceph.com
Sent: Tuesday, August 18, 2015 7:36:10 AM
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

We're using an extra caching layer for ceph since the beginning for our
older ceph deployments. All new deployments go with full SSDs.

I've tested so far:
- EnhanceIO
- Flashcache
- Bcache
- dm-cache
- dm-writeboost

The best working solution was and is bcache except for it's buggy code.
The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e.
discards result in crashed FS after reboots and so on. But it's still
the fastest for ceph.

The 2nd best solution which we already use in production is
dm-writeboost (https://github.com/akiradeveloper/dm-writeboost).

Everything else is too slow.

Stefan
Am 18.08.2015 um 13:33 schrieb Mark Nelson:
> Hi Jan,
>
> Out of curiosity did you ever try dm-cache? I've been meaning to give
> it a spin but haven't had the spare cycles.
>
> Mark
>
> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>> It worked fine during benchmarks and stress tests, but once we run DB2
>> on it it panicked within minutes and took all the data with it (almost
>> literally - files that werent touched, like OS binaries were b0rked
>> and the filesystem was unsalvageable).
>> If you disregard this warning - the performance gains weren't that
>> great either, at least in a VM. It had problems when flushing to disk
>> after reaching dirty watermark and the block size has some
>> not-well-documented implications (not sure now, but I think it only
>> cached IO _larger_than the block size, so if your database keeps
>> incrementing an XX-byte counter it will go straight to disk).
>>
>> Flashcache doesn't respect barriers (or does it now?) - if that's ok
>> for you than go for it, it should be stable and I used it in the past
>> in production without problems.
>>
>> bcache seemed to work fine, but I needed to
>> a) use it for root
>> b) disable and enable it on the fly (doh)
>> c) make it non-persisent (flush it) before reboot - not sure if that
>> was possible either.
>> d) all that in a customer's VM, and that customer didn't have a strong
>> technical background to be able to fiddle with it...
>> So I haven't tested it heavily.
>>
>> Bcache should be the obvious choice if you are in control of the
>> environment. At least you can cry on LKML's shoulder when you lose
>> data :-)
>>
>> Jan
>>
>>
>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>>
>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>> months ago, but no external contributors :(
>>>
>>> The nice thing about EnhanceIO is there is no need to change device
>>> name, unlike bcache, flashcache etc.
>>>
>>> Best regards,
>>> Alex
>>>
>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
>>> wrote:
>>>> I did some (non-ceph) work on these, and concluded that bcache was
>>>> the best
>>>> supported, most stable, and fastest. This was ~1 year ago, to take
>>>> it with
>>>> a grain of salt, but that's what I would recommend.
>>>>
>>>> Daniel
>>>>
>>>>
>>>> ________________________________
>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>> To: "German Anders" <***@despegar.com>
>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Iâve asked same question last weeks or so (just search the mailing list
>>>> archives for EnhanceIO :) and got some interesting answers.
>>>>
>>>> Looks like the project is pretty much dead since it was bought out
>>>> by HGST.
>>>> Even their website has some broken links in regards to EnhanceIO
>>>>
>>>> Iâm keen to try flashcache or bcache (its been in the mainline
>>>> kernel for
>>>> some time)
>>>>
>>>> Dominik
>>>>
>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>>
>>>> Hi cephers,
>>>>
>>>> Is anyone out there that implement enhanceIO in a production
>>>> environment?
>>>> any recommendation? any perf output to share with the diff between
>>>> using it
>>>> and not?
>>>>
>>>> Thanks in advance,
>>>>
>>>> German
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

NOTICE: Protect the information in this message in accordance with the company's security policies. If you received this message in error, immediately notify the sender and destroy all copies.

Stefan Priebe - Profihost AG

2015-08-19 06:59:05 UTC

Permalink

Am 18.08.2015 um 15:43 schrieb Campbell, Bill:
> Hey Stefan,
> Are you using your Ceph cluster for virtualization storage?
Yes

> Is dm-writeboost configured on the OSD nodes themselves?
Yes

Stefan

>
> ------------------------------------------------------------------------
> *From: *"Stefan Priebe - Profihost AG" <***@profihost.ag>
> *To: *"Mark Nelson" <***@redhat.com>, ceph-***@lists.ceph.com
> *Sent: *Tuesday, August 18, 2015 7:36:10 AM
> *Subject: *Re: [ceph-users] any recommendation of using EnhanceIO?
>
> We're using an extra caching layer for ceph since the beginning for our
> older ceph deployments. All new deployments go with full SSDs.
>
> I've tested so far:
> - EnhanceIO
> - Flashcache
> - Bcache
> - dm-cache
> - dm-writeboost
>
> The best working solution was and is bcache except for it's buggy code.
> The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e.
> discards result in crashed FS after reboots and so on. But it's still
> the fastest for ceph.
>
> The 2nd best solution which we already use in production is
> dm-writeboost (https://github.com/akiradeveloper/dm-writeboost).
>
> Everything else is too slow.
>
> Stefan
> Am 18.08.2015 um 13:33 schrieb Mark Nelson:
>> Hi Jan,
>>
>> Out of curiosity did you ever try dm-cache? I've been meaning to give
>> it a spin but haven't had the spare cycles.
>>
>> Mark
>>
>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>> It worked fine during benchmarks and stress tests, but once we run DB2
>>> on it it panicked within minutes and took all the data with it (almost
>>> literally - files that werent touched, like OS binaries were b0rked
>>> and the filesystem was unsalvageable).
>>> If you disregard this warning - the performance gains weren't that
>>> great either, at least in a VM. It had problems when flushing to disk
>>> after reaching dirty watermark and the block size has some
>>> not-well-documented implications (not sure now, but I think it only
>>> cached IO _larger_than the block size, so if your database keeps
>>> incrementing an XX-byte counter it will go straight to disk).
>>>
>>> Flashcache doesn't respect barriers (or does it now?) - if that's ok
>>> for you than go for it, it should be stable and I used it in the past
>>> in production without problems.
>>>
>>> bcache seemed to work fine, but I needed to
>>> a) use it for root
>>> b) disable and enable it on the fly (doh)
>>> c) make it non-persisent (flush it) before reboot - not sure if that
>>> was possible either.
>>> d) all that in a customer's VM, and that customer didn't have a strong
>>> technical background to be able to fiddle with it...
>>> So I haven't tested it heavily.
>>>
>>> Bcache should be the obvious choice if you are in control of the
>>> environment. At least you can cry on LKML's shoulder when you lose
>>> data :-)
>>>
>>> Jan
>>>
>>>
>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>>>
>>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>>> months ago, but no external contributors :(
>>>>
>>>> The nice thing about EnhanceIO is there is no need to change device
>>>> name, unlike bcache, flashcache etc.
>>>>
>>>> Best regards,
>>>> Alex
>>>>
>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
>>>> wrote:
>>>>> I did some (non-ceph) work on these, and concluded that bcache was
>>>>> the best
>>>>> supported, most stable, and fastest. This was ~1 year ago, to take
>>>>> it with
>>>>> a grain of salt, but that's what I would recommend.
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>>> To: "German Anders" <***@despegar.com>
>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I’ve asked same question last weeks or so (just search the mailing list
>>>>> archives for EnhanceIO :) and got some interesting answers.
>>>>>
>>>>> Looks like the project is pretty much dead since it was bought out
>>>>> by HGST.
>>>>> Even their website has some broken links in regards to EnhanceIO
>>>>>
>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>> kernel for
>>>>> some time)
>>>>>
>>>>> Dominik
>>>>>
>>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>>>
>>>>> Hi cephers,
>>>>>
>>>>> Is anyone out there that implement enhanceIO in a production
>>>>> environment?
>>>>> any recommendation? any perf output to share with the diff between
>>>>> using it
>>>>> and not?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> German
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> *NOTICE: Protect the information in this message in accordance with the
> company's security policies. If you received this message in error,
> immediately notify the sender and destroy all copies.*

Jan Schermer

2015-08-18 11:44:12 UTC

Permalink

I did not. Not sure why now - probably for the same reason I didn't extensively test bcache.
I'm not a real fan of device mapper though, so if I had to choose I'd still go for bcache :-)

Jan

> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
>
> Hi Jan,
>
> Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles.
>
> Mark
>
> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>> I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly).
>> It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable).
>> If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk).
>>
>> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems.
>>
>> bcache seemed to work fine, but I needed to
>> a) use it for root
>> b) disable and enable it on the fly (doh)
>> c) make it non-persisent (flush it) before reboot - not sure if that was possible either.
>> d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it...
>> So I haven't tested it heavily.
>>
>> Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-)
>>
>> Jan
>>
>>
>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>>
>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>> months ago, but no external contributors :(
>>>
>>> The nice thing about EnhanceIO is there is no need to change device
>>> name, unlike bcache, flashcache etc.
>>>
>>> Best regards,
>>> Alex
>>>
>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com> wrote:
>>>> I did some (non-ceph) work on these, and concluded that bcache was the best
>>>> supported, most stable, and fastest. This was ~1 year ago, to take it with
>>>> a grain of salt, but that's what I would recommend.
>>>>
>>>> Daniel
>>>>
>>>>
>>>> ________________________________
>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>> To: "German Anders" <***@despegar.com>
>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I’ve asked same question last weeks or so (just search the mailing list
>>>> archives for EnhanceIO :) and got some interesting answers.
>>>>
>>>> Looks like the project is pretty much dead since it was bought out by HGST.
>>>> Even their website has some broken links in regards to EnhanceIO
>>>>
>>>> I’m keen to try flashcache or bcache (its been in the mainline kernel for
>>>> some time)
>>>>
>>>> Dominik
>>>>
>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>>
>>>> Hi cephers,
>>>>
>>>> Is anyone out there that implement enhanceIO in a production environment?
>>>> any recommendation? any perf output to share with the diff between using it
>>>> and not?
>>>>
>>>> Thanks in advance,
>>>>
>>>> German
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Nick Fisk

2015-08-18 11:47:55 UTC

Permalink

Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes.

It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 12:44
> To: Mark Nelson <***@redhat.com>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> I did not. Not sure why now - probably for the same reason I didn't
> extensively test bcache.
> I'm not a real fan of device mapper though, so if I had to choose I'd still go for
> bcache :-)
>
> Jan
>
> > On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
> >
> > Hi Jan,
> >
> > Out of curiosity did you ever try dm-cache? I've been meaning to give it a
> spin but haven't had the spare cycles.
> >
> > Mark
> >
> > On 08/18/2015 04:00 AM, Jan Schermer wrote:
> >> I already evaluated EnhanceIO in combination with CentOS 6 (and
> backported 3.10 and 4.0 kernel-lt if I remember correctly).
> >> It worked fine during benchmarks and stress tests, but once we run DB2
> on it it panicked within minutes and took all the data with it (almost literally -
> files that werent touched, like OS binaries were b0rked and the filesystem
> was unsalvageable).
> >> If you disregard this warning - the performance gains weren't that great
> either, at least in a VM. It had problems when flushing to disk after reaching
> dirty watermark and the block size has some not-well-documented
> implications (not sure now, but I think it only cached IO _larger_than the
> block size, so if your database keeps incrementing an XX-byte counter it will
> go straight to disk).
> >>
> >> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you
> than go for it, it should be stable and I used it in the past in production
> without problems.
> >>
> >> bcache seemed to work fine, but I needed to
> >> a) use it for root
> >> b) disable and enable it on the fly (doh)
> >> c) make it non-persisent (flush it) before reboot - not sure if that was
> possible either.
> >> d) all that in a customer's VM, and that customer didn't have a strong
> technical background to be able to fiddle with it...
> >> So I haven't tested it heavily.
> >>
> >> Bcache should be the obvious choice if you are in control of the
> >> environment. At least you can cry on LKML's shoulder when you lose
> >> data :-)
> >>
> >> Jan
> >>
> >>
> >>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com>
> wrote:
> >>>
> >>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
> >>> months ago, but no external contributors :(
> >>>
> >>> The nice thing about EnhanceIO is there is no need to change device
> >>> name, unlike bcache, flashcache etc.
> >>>
> >>> Best regards,
> >>> Alex
> >>>
> >>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
> wrote:
> >>>> I did some (non-ceph) work on these, and concluded that bcache was
> >>>> the best supported, most stable, and fastest. This was ~1 year
> >>>> ago, to take it with a grain of salt, but that's what I would recommend.
> >>>>
> >>>> Daniel
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: "Dominik Zalewski" <***@optlink.net>
> >>>> To: "German Anders" <***@despegar.com>
> >>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
> >>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
> >>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I’ve asked same question last weeks or so (just search the mailing
> >>>> list archives for EnhanceIO :) and got some interesting answers.
> >>>>
> >>>> Looks like the project is pretty much dead since it was bought out by
> HGST.
> >>>> Even their website has some broken links in regards to EnhanceIO
> >>>>
> >>>> I’m keen to try flashcache or bcache (its been in the mainline
> >>>> kernel for some time)
> >>>>
> >>>> Dominik
> >>>>
> >>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com>
> wrote:
> >>>>
> >>>> Hi cephers,
> >>>>
> >>>> Is anyone out there that implement enhanceIO in a production
> environment?
> >>>> any recommendation? any perf output to share with the diff between
> >>>> using it and not?
> >>>>
> >>>> Thanks in advance,
> >>>>
> >>>> German
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-***@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Mark Nelson

2015-08-18 13:50:59 UTC

Permalink

On 08/18/2015 06:47 AM, Nick Fisk wrote:
> Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes.
>
> It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache.

For your use case, is it ok that data may live on the flashcache for
some amount of time before making to ceph to be replicated? We've
wondered internally if this kind of trade-off is acceptable to customers
or not should the flashcache SSD fail.

>
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
>> Jan Schermer
>> Sent: 18 August 2015 12:44
>> To: Mark Nelson <***@redhat.com>
>> Cc: ceph-***@lists.ceph.com
>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>
>> I did not. Not sure why now - probably for the same reason I didn't
>> extensively test bcache.
>> I'm not a real fan of device mapper though, so if I had to choose I'd still go for
>> bcache :-)
>>
>> Jan
>>
>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
>>>
>>> Hi Jan,
>>>
>>> Out of curiosity did you ever try dm-cache? I've been meaning to give it a
>> spin but haven't had the spare cycles.
>>>
>>> Mark
>>>
>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>>> It worked fine during benchmarks and stress tests, but once we run DB2
>> on it it panicked within minutes and took all the data with it (almost literally -
>> files that werent touched, like OS binaries were b0rked and the filesystem
>> was unsalvageable).
>>>> If you disregard this warning - the performance gains weren't that great
>> either, at least in a VM. It had problems when flushing to disk after reaching
>> dirty watermark and the block size has some not-well-documented
>> implications (not sure now, but I think it only cached IO _larger_than the
>> block size, so if your database keeps incrementing an XX-byte counter it will
>> go straight to disk).
>>>>
>>>> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you
>> than go for it, it should be stable and I used it in the past in production
>> without problems.
>>>>
>>>> bcache seemed to work fine, but I needed to
>>>> a) use it for root
>>>> b) disable and enable it on the fly (doh)
>>>> c) make it non-persisent (flush it) before reboot - not sure if that was
>> possible either.
>>>> d) all that in a customer's VM, and that customer didn't have a strong
>> technical background to be able to fiddle with it...
>>>> So I haven't tested it heavily.
>>>>
>>>> Bcache should be the obvious choice if you are in control of the
>>>> environment. At least you can cry on LKML's shoulder when you lose
>>>> data :-)
>>>>
>>>> Jan
>>>>
>>>>
>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com>
>> wrote:
>>>>>
>>>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>>>> months ago, but no external contributors :(
>>>>>
>>>>> The nice thing about EnhanceIO is there is no need to change device
>>>>> name, unlike bcache, flashcache etc.
>>>>>
>>>>> Best regards,
>>>>> Alex
>>>>>
>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
>> wrote:
>>>>>> I did some (non-ceph) work on these, and concluded that bcache was
>>>>>> the best supported, most stable, and fastest. This was ~1 year
>>>>>> ago, to take it with a grain of salt, but that's what I would recommend.
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>>>> To: "German Anders" <***@despegar.com>
>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I’ve asked same question last weeks or so (just search the mailing
>>>>>> list archives for EnhanceIO :) and got some interesting answers.
>>>>>>
>>>>>> Looks like the project is pretty much dead since it was bought out by
>> HGST.
>>>>>> Even their website has some broken links in regards to EnhanceIO
>>>>>>
>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>>> kernel for some time)
>>>>>>
>>>>>> Dominik
>>>>>>
>>>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com>
>> wrote:
>>>>>>
>>>>>> Hi cephers,
>>>>>>
>>>>>> Is anyone out there that implement enhanceIO in a production
>> environment?
>>>>>> any recommendation? any perf output to share with the diff between
>>>>>> using it and not?
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> German
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>

Jan Schermer

2015-08-18 14:24:01 UTC

Permalink

> On 18 Aug 2015, at 15:50, Mark Nelson <***@redhat.com> wrote:
>
>
>
> On 08/18/2015 06:47 AM, Nick Fisk wrote:
>> Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes.
>>
>> It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache.
>
> For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail.
>

Was it me pestering you about it? :-)
All my customers need this desperately - people don't care about having RPO=0 seconds when all hell breaks loose.
People care about their apps being slow all the time which is effectively an "outage".
I (sysadmin) care about having consistent data where all I have to do is start up the VMs.

Any ideas how to approach this? I think even checkpoints (like reverting to a known point in the past) would be great and sufficient for most people...

>>
>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
>>> Jan Schermer
>>> Sent: 18 August 2015 12:44
>>> To: Mark Nelson <***@redhat.com>
>>> Cc: ceph-***@lists.ceph.com
>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>
>>> I did not. Not sure why now - probably for the same reason I didn't
>>> extensively test bcache.
>>> I'm not a real fan of device mapper though, so if I had to choose I'd still go for
>>> bcache :-)
>>>
>>> Jan
>>>
>>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
>>>>
>>>> Hi Jan,
>>>>
>>>> Out of curiosity did you ever try dm-cache? I've been meaning to give it a
>>> spin but haven't had the spare cycles.
>>>>
>>>> Mark
>>>>
>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>>>> It worked fine during benchmarks and stress tests, but once we run DB2
>>> on it it panicked within minutes and took all the data with it (almost literally -
>>> files that werent touched, like OS binaries were b0rked and the filesystem
>>> was unsalvageable).
>>>>> If you disregard this warning - the performance gains weren't that great
>>> either, at least in a VM. It had problems when flushing to disk after reaching
>>> dirty watermark and the block size has some not-well-documented
>>> implications (not sure now, but I think it only cached IO _larger_than the
>>> block size, so if your database keeps incrementing an XX-byte counter it will
>>> go straight to disk).
>>>>>
>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you
>>> than go for it, it should be stable and I used it in the past in production
>>> without problems.
>>>>>
>>>>> bcache seemed to work fine, but I needed to
>>>>> a) use it for root
>>>>> b) disable and enable it on the fly (doh)
>>>>> c) make it non-persisent (flush it) before reboot - not sure if that was
>>> possible either.
>>>>> d) all that in a customer's VM, and that customer didn't have a strong
>>> technical background to be able to fiddle with it...
>>>>> So I haven't tested it heavily.
>>>>>
>>>>> Bcache should be the obvious choice if you are in control of the
>>>>> environment. At least you can cry on LKML's shoulder when you lose
>>>>> data :-)
>>>>>
>>>>> Jan
>>>>>
>>>>>
>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com>
>>> wrote:
>>>>>>
>>>>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>>>>> months ago, but no external contributors :(
>>>>>>
>>>>>> The nice thing about EnhanceIO is there is no need to change device
>>>>>> name, unlike bcache, flashcache etc.
>>>>>>
>>>>>> Best regards,
>>>>>> Alex
>>>>>>
>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
>>> wrote:
>>>>>>> I did some (non-ceph) work on these, and concluded that bcache was
>>>>>>> the best supported, most stable, and fastest. This was ~1 year
>>>>>>> ago, to take it with a grain of salt, but that's what I would recommend.
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>>>>> To: "German Anders" <***@despegar.com>
>>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I’ve asked same question last weeks or so (just search the mailing
>>>>>>> list archives for EnhanceIO :) and got some interesting answers.
>>>>>>>
>>>>>>> Looks like the project is pretty much dead since it was bought out by
>>> HGST.
>>>>>>> Even their website has some broken links in regards to EnhanceIO
>>>>>>>
>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>>>> kernel for some time)
>>>>>>>
>>>>>>> Dominik
>>>>>>>
>>>>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com>
>>> wrote:
>>>>>>>
>>>>>>> Hi cephers,
>>>>>>>
>>>>>>> Is anyone out there that implement enhanceIO in a production
>>> environment?
>>>>>>> any recommendation? any perf output to share with the diff between
>>>>>>> using it and not?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> German
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>

Mark Nelson

2015-08-18 14:55:23 UTC

Permalink

On 08/18/2015 09:24 AM, Jan Schermer wrote:
>
>> On 18 Aug 2015, at 15:50, Mark Nelson <***@redhat.com> wrote:
>>
>>
>>
>> On 08/18/2015 06:47 AM, Nick Fisk wrote:
>>> Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes.
>>>
>>> It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache.
>>
>> For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail.
>>
>
> Was it me pestering you about it? :-)
> All my customers need this desperately - people don't care about having RPO=0 seconds when all hell breaks loose.
> People care about their apps being slow all the time which is effectively an "outage".
> I (sysadmin) care about having consistent data where all I have to do is start up the VMs.
>
> Any ideas how to approach this? I think even checkpoints (like reverting to a known point in the past) would be great and sufficient for most people...

Here's kind of how I see the field right now:

1) Cache at the client level. Likely fastest but obvious issues like
above. RAID1 might be an option at increased cost. Lack of barriers in
some implementations scary.

2) Cache below the OSD. Not much recent data on this. Not likely as
fast as client side cache, but likely cheaper (fewer OSD nodes than
client nodes?). Lack of barriers in some implementations scary.

3) Ceph Cache Tiering. Network overhead and write amplification on
promotion makes this primarily useful when workloads fit mostly into the
cache tier. Overall safe design but care must be taken to not over-promote.

4) separate SSD pool. Manual and not particularly flexible, but perhaps
best for applications that need consistently high performance.

>
>
>>>
>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
>>>> Jan Schermer
>>>> Sent: 18 August 2015 12:44
>>>> To: Mark Nelson <***@redhat.com>
>>>> Cc: ceph-***@lists.ceph.com
>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>
>>>> I did not. Not sure why now - probably for the same reason I didn't
>>>> extensively test bcache.
>>>> I'm not a real fan of device mapper though, so if I had to choose I'd still go for
>>>> bcache :-)
>>>>
>>>> Jan
>>>>
>>>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
>>>>>
>>>>> Hi Jan,
>>>>>
>>>>> Out of curiosity did you ever try dm-cache? I've been meaning to give it a
>>>> spin but haven't had the spare cycles.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>>>>> It worked fine during benchmarks and stress tests, but once we run DB2
>>>> on it it panicked within minutes and took all the data with it (almost literally -
>>>> files that werent touched, like OS binaries were b0rked and the filesystem
>>>> was unsalvageable).
>>>>>> If you disregard this warning - the performance gains weren't that great
>>>> either, at least in a VM. It had problems when flushing to disk after reaching
>>>> dirty watermark and the block size has some not-well-documented
>>>> implications (not sure now, but I think it only cached IO _larger_than the
>>>> block size, so if your database keeps incrementing an XX-byte counter it will
>>>> go straight to disk).
>>>>>>
>>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you
>>>> than go for it, it should be stable and I used it in the past in production
>>>> without problems.
>>>>>>
>>>>>> bcache seemed to work fine, but I needed to
>>>>>> a) use it for root
>>>>>> b) disable and enable it on the fly (doh)
>>>>>> c) make it non-persisent (flush it) before reboot - not sure if that was
>>>> possible either.
>>>>>> d) all that in a customer's VM, and that customer didn't have a strong
>>>> technical background to be able to fiddle with it...
>>>>>> So I haven't tested it heavily.
>>>>>>
>>>>>> Bcache should be the obvious choice if you are in control of the
>>>>>> environment. At least you can cry on LKML's shoulder when you lose
>>>>>> data :-)
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>>
>>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com>
>>>> wrote:
>>>>>>>
>>>>>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>>>>>> months ago, but no external contributors :(
>>>>>>>
>>>>>>> The nice thing about EnhanceIO is there is no need to change device
>>>>>>> name, unlike bcache, flashcache etc.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Alex
>>>>>>>
>>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com>
>>>> wrote:
>>>>>>>> I did some (non-ceph) work on these, and concluded that bcache was
>>>>>>>> the best supported, most stable, and fastest. This was ~1 year
>>>>>>>> ago, to take it with a grain of salt, but that's what I would recommend.
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>>>>>> To: "German Anders" <***@despegar.com>
>>>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I’ve asked same question last weeks or so (just search the mailing
>>>>>>>> list archives for EnhanceIO :) and got some interesting answers.
>>>>>>>>
>>>>>>>> Looks like the project is pretty much dead since it was bought out by
>>>> HGST.
>>>>>>>> Even their website has some broken links in regards to EnhanceIO
>>>>>>>>
>>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>>>>> kernel for some time)
>>>>>>>>
>>>>>>>> Dominik
>>>>>>>>
>>>>>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> Hi cephers,
>>>>>>>>
>>>>>>>> Is anyone out there that implement enhanceIO in a production
>>>> environment?
>>>>>>>> any recommendation? any perf output to share with the diff between
>>>>>>>> using it and not?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>>
>>>>>>>> German
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>

Nick Fisk

2015-08-18 16:08:59 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 18 August 2015 15:55
> To: Jan Schermer <***@schermer.cz>
> Cc: ceph-***@lists.ceph.com; Nick Fisk <***@fisk.me.uk>
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
>
>
> On 08/18/2015 09:24 AM, Jan Schermer wrote:
> >
> >> On 18 Aug 2015, at 15:50, Mark Nelson <***@redhat.com> wrote:
> >>
> >>
> >>
> >> On 08/18/2015 06:47 AM, Nick Fisk wrote:
> >>> Just to chime in, I gave dmcache a limited test but its lack of proper
> writeback cache ruled it out for me. It only performs write back caching on
> blocks already on the SSD, whereas I need something that works like a
> Battery backed raid controller caching all writes.
> >>>
> >>> It's amazing the 100x performance increase you get with RBD's when
> doing sync writes and give it something like just 1GB write back cache with
> flashcache.
> >>
> >> For your use case, is it ok that data may live on the flashcache for some
> amount of time before making to ceph to be replicated? We've wondered
> internally if this kind of trade-off is acceptable to customers or not should the
> flashcache SSD fail.
> >>
> >
> > Was it me pestering you about it? :-)
> > All my customers need this desperately - people don't care about having
> RPO=0 seconds when all hell breaks loose.
> > People care about their apps being slow all the time which is effectively an
> "outage".
> > I (sysadmin) care about having consistent data where all I have to do is start
> up the VMs.
> >
> > Any ideas how to approach this? I think even checkpoints (like reverting to
> a known point in the past) would be great and sufficient for most people...
>
> Here's kind of how I see the field right now:
>
> 1) Cache at the client level. Likely fastest but obvious issues like above.
> RAID1 might be an option at increased cost. Lack of barriers in some
> implementations scary.

Agreed.

>
> 2) Cache below the OSD. Not much recent data on this. Not likely as fast as
> client side cache, but likely cheaper (fewer OSD nodes than client nodes?).
> Lack of barriers in some implementations scary.

This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk.

I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB?

>
> 3) Ceph Cache Tiering. Network overhead and write amplification on
> promotion makes this primarily useful when workloads fit mostly into the
> cache tier. Overall safe design but care must be taken to not over-promote.
>
> 4) separate SSD pool. Manual and not particularly flexible, but perhaps best
> for applications that need consistently high performance.

I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle.

To give a real world example of what I see when doing various tests, here is a rough guide to IOP's when removing a snapshot on a ESX server

Traditional Array 10K disks = 300-600 IOPs
Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the main limitation)
Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)
Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)
Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high bursts if snapshot blocks are sequential)

And when copying VM's to datastore (ESXi does this in sequential 64k IO's.....yes silly I know)

Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other arrays I guess this scales)
Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here for sequential writes)
Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring)
Ceph Cache Tiering = ~50MB/s when writing to new block, <10MB/s when promote+overwrite
Ceph + RBD Caching with Flashcache = As fast as the SSD will go

>
> >
> >
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> >>>> Behalf Of Jan Schermer
> >>>> Sent: 18 August 2015 12:44
> >>>> To: Mark Nelson <***@redhat.com>
> >>>> Cc: ceph-***@lists.ceph.com
> >>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>>>
> >>>> I did not. Not sure why now - probably for the same reason I didn't
> >>>> extensively test bcache.
> >>>> I'm not a real fan of device mapper though, so if I had to choose
> >>>> I'd still go for bcache :-)
> >>>>
> >>>> Jan
> >>>>
> >>>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com>
> wrote:
> >>>>>
> >>>>> Hi Jan,
> >>>>>
> >>>>> Out of curiosity did you ever try dm-cache? I've been meaning to
> >>>>> give it a
> >>>> spin but haven't had the spare cycles.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
> >>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
> >>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
> >>>>>> It worked fine during benchmarks and stress tests, but once we
> >>>>>> run DB2
> >>>> on it it panicked within minutes and took all the data with it
> >>>> (almost literally - files that werent touched, like OS binaries
> >>>> were b0rked and the filesystem was unsalvageable).
> >>>>>> If you disregard this warning - the performance gains weren't
> >>>>>> that great
> >>>> either, at least in a VM. It had problems when flushing to disk
> >>>> after reaching dirty watermark and the block size has some
> >>>> not-well-documented implications (not sure now, but I think it only
> >>>> cached IO _larger_than the block size, so if your database keeps
> >>>> incrementing an XX-byte counter it will go straight to disk).
> >>>>>>
> >>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's
> >>>>>> ok for you
> >>>> than go for it, it should be stable and I used it in the past in
> >>>> production without problems.
> >>>>>>
> >>>>>> bcache seemed to work fine, but I needed to
> >>>>>> a) use it for root
> >>>>>> b) disable and enable it on the fly (doh)
> >>>>>> c) make it non-persisent (flush it) before reboot - not sure if
> >>>>>> that was
> >>>> possible either.
> >>>>>> d) all that in a customer's VM, and that customer didn't have a
> >>>>>> strong
> >>>> technical background to be able to fiddle with it...
> >>>>>> So I haven't tested it heavily.
> >>>>>>
> >>>>>> Bcache should be the obvious choice if you are in control of the
> >>>>>> environment. At least you can cry on LKML's shoulder when you
> >>>>>> lose data :-)
> >>>>>>
> >>>>>> Jan
> >>>>>>
> >>>>>>
> >>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev
> >>>>>>> <***@iss-integration.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> What about https://github.com/Frontier314/EnhanceIO? Last
> >>>>>>> commit 2 months ago, but no external contributors :(
> >>>>>>>
> >>>>>>> The nice thing about EnhanceIO is there is no need to change
> >>>>>>> device name, unlike bcache, flashcache etc.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
> >>>>>>> <***@redhat.com>
> >>>> wrote:
> >>>>>>>> I did some (non-ceph) work on these, and concluded that bcache
> >>>>>>>> was the best supported, most stable, and fastest. This was ~1
> >>>>>>>> year ago, to take it with a grain of salt, but that's what I would
> recommend.
> >>>>>>>>
> >>>>>>>> Daniel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: "Dominik Zalewski" <***@optlink.net>
> >>>>>>>> To: "German Anders" <***@despegar.com>
> >>>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
> >>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
> >>>>>>>> Subject: Re: [ceph-users] any recommendation of using
> EnhanceIO?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I’ve asked same question last weeks or so (just search the
> >>>>>>>> mailing list archives for EnhanceIO :) and got some interesting
> answers.
> >>>>>>>>
> >>>>>>>> Looks like the project is pretty much dead since it was bought
> >>>>>>>> out by
> >>>> HGST.
> >>>>>>>> Even their website has some broken links in regards to
> >>>>>>>> EnhanceIO
> >>>>>>>>
> >>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
> >>>>>>>> kernel for some time)
> >>>>>>>>
> >>>>>>>> Dominik
> >>>>>>>>
> >>>>>>>> On 1 Jul 2015, at 21:13, German Anders
> <***@despegar.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi cephers,
> >>>>>>>>
> >>>>>>>> Is anyone out there that implement enhanceIO in a production
> >>>> environment?
> >>>>>>>> any recommendation? any perf output to share with the diff
> >>>>>>>> between using it and not?
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>>
> >>>>>>>> German
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-***@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-***@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-***@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>> ceph-***@lists.ceph.com
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-***@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-***@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>>
> >>>
> >
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Nick Fisk
Technical Support Engineer

System Professional Ltd
tel: 01825 830000
mob: 07711377522
fax: 01825 830001
mail: ***@sys-pro.co.uk
web: www.sys-pro.co.uk<http://www.sys-pro.co.uk>

IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING

Registered Office:
Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
Registered in England and Wales.
Company Number: 04754200

Confidentiality: This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must take no action based on them, nor must you copy or show them to anyone; please reply to this e-mail and highlight the error.

Security Warning: Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.

Viruses: Although we have taken steps to ensure that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Any views expressed in this e-mail message are those of the individual and not necessarily those of the company or any of its subsidiaries.

Mark Nelson

2015-08-18 16:24:07 UTC

Permalink

On 08/18/2015 11:08 AM, Nick Fisk wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 18 August 2015 15:55
>> To: Jan Schermer <***@schermer.cz>
>> Cc: ceph-***@lists.ceph.com; Nick Fisk <***@fisk.me.uk>
>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>
>>
>>
>> On 08/18/2015 09:24 AM, Jan Schermer wrote:
>>>
>>>> On 18 Aug 2015, at 15:50, Mark Nelson <***@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 08/18/2015 06:47 AM, Nick Fisk wrote:
>>>>> Just to chime in, I gave dmcache a limited test but its lack of proper
>> writeback cache ruled it out for me. It only performs write back caching on
>> blocks already on the SSD, whereas I need something that works like a
>> Battery backed raid controller caching all writes.
>>>>>
>>>>> It's amazing the 100x performance increase you get with RBD's when
>> doing sync writes and give it something like just 1GB write back cache with
>> flashcache.
>>>>
>>>> For your use case, is it ok that data may live on the flashcache for some
>> amount of time before making to ceph to be replicated? We've wondered
>> internally if this kind of trade-off is acceptable to customers or not should the
>> flashcache SSD fail.
>>>>
>>>
>>> Was it me pestering you about it? :-)
>>> All my customers need this desperately - people don't care about having
>> RPO=0 seconds when all hell breaks loose.
>>> People care about their apps being slow all the time which is effectively an
>> "outage".
>>> I (sysadmin) care about having consistent data where all I have to do is start
>> up the VMs.
>>>
>>> Any ideas how to approach this? I think even checkpoints (like reverting to
>> a known point in the past) would be great and sufficient for most people...
>>
>> Here's kind of how I see the field right now:
>>
>> 1) Cache at the client level. Likely fastest but obvious issues like above.
>> RAID1 might be an option at increased cost. Lack of barriers in some
>> implementations scary.
>
> Agreed.
>
>>
>> 2) Cache below the OSD. Not much recent data on this. Not likely as fast as
>> client side cache, but likely cheaper (fewer OSD nodes than client nodes?).
>> Lack of barriers in some implementations scary.
>
> This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk.
>
> I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB?

I believe you can already do this, though I haven't tested it. You can
certainly move the monitors to rocksdb (tested) and newstore uses
rocksdb as well.

>
>>
>> 3) Ceph Cache Tiering. Network overhead and write amplification on
>> promotion makes this primarily useful when workloads fit mostly into the
>> cache tier. Overall safe design but care must be taken to not over-promote.
>>
>> 4) separate SSD pool. Manual and not particularly flexible, but perhaps best
>> for applications that need consistently high performance.
>
> I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle.

Agreed. This is definitely the crux of the problem. The example below
is a great start! It'd would be fantastic if we could get more feedback
from the list on the relative importance of low latency operations vs
high IOPS through concurrency. We have general suspicions but not a ton
of actual data regarding what folks are seeing in practice and under
what scenarios.

>
>
> To give a real world example of what I see when doing various tests, here is a rough guide to IOP's when removing a snapshot on a ESX server
>
> Traditional Array 10K disks = 300-600 IOPs
> Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the main limitation)
> Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)

I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help
here. Sandisk and Intel have both done some very useful investigations,
I've got some additional tests replicating some of their findings coming
shortly.

> Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)

Indeed. There's some work going on in this area too. Hopefully we'll
know how some of our ideas pan out later this week. Assuming excessive
promotions aren't a problem, the jemalloc/tcmalloc improvements I
suspect will generally make cache teiring more interesting (though
buffer cache will still be the primary source of really hot cached reads)

> Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high bursts if snapshot blocks are sequential)

Good to know!

>
> And when copying VM's to datastore (ESXi does this in sequential 64k IO's.....yes silly I know)
>
> Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other arrays I guess this scales)
> Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here for sequential writes)

This is pretty bad. Is RBD cache enabled?

> Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring)

Again, seems pretty rough compared to what I'd expect to see!

> Ceph Cache Tiering = ~50MB/s when writing to new block, <10MB/s when promote+overwrite
> Ceph + RBD Caching with Flashcache = As fast as the SSD will go
>
>
>>
>>>
>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
>>>>>> Behalf Of Jan Schermer
>>>>>> Sent: 18 August 2015 12:44
>>>>>> To: Mark Nelson <***@redhat.com>
>>>>>> Cc: ceph-***@lists.ceph.com
>>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>>>
>>>>>> I did not. Not sure why now - probably for the same reason I didn't
>>>>>> extensively test bcache.
>>>>>> I'm not a real fan of device mapper though, so if I had to choose
>>>>>> I'd still go for bcache :-)
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com>
>> wrote:
>>>>>>>
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> Out of curiosity did you ever try dm-cache? I've been meaning to
>>>>>>> give it a
>>>>>> spin but haven't had the spare cycles.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>>>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>>>>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>>>>>>> It worked fine during benchmarks and stress tests, but once we
>>>>>>>> run DB2
>>>>>> on it it panicked within minutes and took all the data with it
>>>>>> (almost literally - files that werent touched, like OS binaries
>>>>>> were b0rked and the filesystem was unsalvageable).
>>>>>>>> If you disregard this warning - the performance gains weren't
>>>>>>>> that great
>>>>>> either, at least in a VM. It had problems when flushing to disk
>>>>>> after reaching dirty watermark and the block size has some
>>>>>> not-well-documented implications (not sure now, but I think it only
>>>>>> cached IO _larger_than the block size, so if your database keeps
>>>>>> incrementing an XX-byte counter it will go straight to disk).
>>>>>>>>
>>>>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's
>>>>>>>> ok for you
>>>>>> than go for it, it should be stable and I used it in the past in
>>>>>> production without problems.
>>>>>>>>
>>>>>>>> bcache seemed to work fine, but I needed to
>>>>>>>> a) use it for root
>>>>>>>> b) disable and enable it on the fly (doh)
>>>>>>>> c) make it non-persisent (flush it) before reboot - not sure if
>>>>>>>> that was
>>>>>> possible either.
>>>>>>>> d) all that in a customer's VM, and that customer didn't have a
>>>>>>>> strong
>>>>>> technical background to be able to fiddle with it...
>>>>>>>> So I haven't tested it heavily.
>>>>>>>>
>>>>>>>> Bcache should be the obvious choice if you are in control of the
>>>>>>>> environment. At least you can cry on LKML's shoulder when you
>>>>>>>> lose data :-)
>>>>>>>>
>>>>>>>> Jan
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev
>>>>>>>>> <***@iss-integration.com>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> What about https://github.com/Frontier314/EnhanceIO? Last
>>>>>>>>> commit 2 months ago, but no external contributors :(
>>>>>>>>>
>>>>>>>>> The nice thing about EnhanceIO is there is no need to change
>>>>>>>>> device name, unlike bcache, flashcache etc.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
>>>>>>>>> <***@redhat.com>
>>>>>> wrote:
>>>>>>>>>> I did some (non-ceph) work on these, and concluded that bcache
>>>>>>>>>> was the best supported, most stable, and fastest. This was ~1
>>>>>>>>>> year ago, to take it with a grain of salt, but that's what I would
>> recommend.
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ________________________________
>>>>>>>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>>>>>>>> To: "German Anders" <***@despegar.com>
>>>>>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>>>>>>> Subject: Re: [ceph-users] any recommendation of using
>> EnhanceIO?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I’ve asked same question last weeks or so (just search the
>>>>>>>>>> mailing list archives for EnhanceIO :) and got some interesting
>> answers.
>>>>>>>>>>
>>>>>>>>>> Looks like the project is pretty much dead since it was bought
>>>>>>>>>> out by
>>>>>> HGST.
>>>>>>>>>> Even their website has some broken links in regards to
>>>>>>>>>> EnhanceIO
>>>>>>>>>>
>>>>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>>>>>>> kernel for some time)
>>>>>>>>>>
>>>>>>>>>> Dominik
>>>>>>>>>>
>>>>>>>>>> On 1 Jul 2015, at 21:13, German Anders
>> <***@despegar.com>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi cephers,
>>>>>>>>>>
>>>>>>>>>> Is anyone out there that implement enhanceIO in a production
>>>>>> environment?
>>>>>>>>>> any recommendation? any perf output to share with the diff
>>>>>>>>>> between using it and not?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>>
>>>>>>>>>> German
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> Nick Fisk
> Technical Support Engineer
>
> System Professional Ltd
> tel: 01825 830000
> mob: 07711377522
> fax: 01825 830001
> mail: ***@sys-pro.co.uk
> web: www.sys-pro.co.uk<http://www.sys-pro.co.uk>
>
> IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING
>
> Registered Office:
> Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
> Registered in England and Wales.
> Company Number: 04754200
>
>
> Confidentiality: This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must take no action based on them, nor must you copy or show them to anyone; please reply to this e-mail and highlight the error.
>
> Security Warning: Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.
>
> Viruses: Although we have taken steps to ensure that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Any views expressed in this e-mail message are those of the individual and not necessarily those of the company or any of its subsidiaries.
>

Nick Fisk

2015-08-18 16:52:55 UTC

Permalink

<snip>
> >>
> >> Here's kind of how I see the field right now:
> >>
> >> 1) Cache at the client level. Likely fastest but obvious issues like above.
> >> RAID1 might be an option at increased cost. Lack of barriers in some
> >> implementations scary.
> >
> > Agreed.
> >
> >>
> >> 2) Cache below the OSD. Not much recent data on this. Not likely as
> >> fast as client side cache, but likely cheaper (fewer OSD nodes than client
> nodes?).
> >> Lack of barriers in some implementations scary.
> >
> > This also has the benefit of caching the leveldb on the OSD, so get a big
> performance gain from there too for small sequential writes. I looked at
> using Flashcache for this too but decided it was adding to much complexity
> and risk.
> >
> > I thought I read somewhere that RocksDB allows you to move its WAL to
> SSD, is there anything in the pipeline for something like moving the filestore
> to use RocksDB?
>
> I believe you can already do this, though I haven't tested it. You can certainly
> move the monitors to rocksdb (tested) and newstore uses rocksdb as well.
>

Interesting, I might have a look into this.

> >
> >>
> >> 3) Ceph Cache Tiering. Network overhead and write amplification on
> >> promotion makes this primarily useful when workloads fit mostly into the
> >> cache tier. Overall safe design but care must be taken to not over-
> promote.
> >>
> >> 4) separate SSD pool. Manual and not particularly flexible, but perhaps
> best
> >> for applications that need consistently high performance.
> >
> > I think it depends on the definition of performance. Currently even very
> fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of
> write latency. If your performance requirements are for large queue depths
> then you will probably be alright. If you require something that mirrors the
> performance of traditional write back cache, then even pure SSD Pools can
> start to struggle.
>
> Agreed. This is definitely the crux of the problem. The example below
> is a great start! It'd would be fantastic if we could get more feedback
> from the list on the relative importance of low latency operations vs
> high IOPS through concurrency. We have general suspicions but not a ton
> of actual data regarding what folks are seeing in practice and under
> what scenarios.
>

If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs.

> >
> >
> > To give a real world example of what I see when doing various tests, here
> is a rough guide to IOP's when removing a snapshot on a ESX server
> >
> > Traditional Array 10K disks = 300-600 IOPs
> > Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to
> be the main limitation)
> > Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)
>
> I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help
> here. Sandisk and Intel have both done some very useful investigations,
> I've got some additional tests replicating some of their findings coming
> shortly.

Ok, will be interesting to se. I will see if I can change it on my environment and if it has any improvement. I think I came to the conclusion that Ceph takes a certain amount of time to do a write and by the time you add in a replica copy I was struggling to get much below 2ms per IO with my 2.1GHz CPU's. 2ms = ~500IOPs.

>
> > Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)
>
> Indeed. There's some work going on in this area too. Hopefully we'll
> know how some of our ideas pan out later this week. Assuming excessive
> promotions aren't a problem, the jemalloc/tcmalloc improvements I
> suspect will generally make cache teiring more interesting (though
> buffer cache will still be the primary source of really hot cached reads)
>
> > Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give
> high bursts if snapshot blocks are sequential)
>
> Good to know!
>
> >
> > And when copying VM's to datastore (ESXi does this in sequential 64k
> IO's.....yes silly I know)
> >
> > Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other
> arrays I guess this scales)
> > Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here
> for sequential writes)
>
> This is pretty bad. Is RBD cache enabled?

Tell me about it, moving a 2TB VM is a painful experience. Yes the librbd cache is on, but as iSCSI effectively turns all writes into sync writes so this bypasses the cache, so you are dependent on the time it takes for each OSD to ACK the write. In this case waiting each time for 64kb IO's to complete due to the levelDB sync you end up with transfer speeds somewhere in the region of 15-20MB/s. You can do the same thing with something IOmeter (64k, sequential write, directio, QD=1).

NFS is even worse as every ESX write also requires a FS journal sync on the FS being used for NFS. So you have to wait for two ACK's from Ceph, normally meaning <10MB/s.

Here is a thread from earlier in the year when I stumbled on the reason behind this (Last post)
http://comments.gmane.org/gmane.comp.file-systems.ceph.user/18393

>
> > Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring)
>
> Again, seems pretty rough compared to what I'd expect to see!

Same as above but CPU replaces levelDB sync as bottleneck in my findings

>
> > Ceph Cache Tiering = ~50MB/s when writing to new block, <10MB/s when
> promote+overwrite
> > Ceph + RBD Caching with Flashcache = As fast as the SSD will go

Mark Nelson

2015-08-18 17:50:38 UTC

Permalink

On 08/18/2015 11:52 AM, Nick Fisk wrote:
> <snip>
>>>>
>>>> Here's kind of how I see the field right now:
>>>>
>>>> 1) Cache at the client level. Likely fastest but obvious issues like above.
>>>> RAID1 might be an option at increased cost. Lack of barriers in some
>>>> implementations scary.
>>>
>>> Agreed.
>>>
>>>>
>>>> 2) Cache below the OSD. Not much recent data on this. Not likely as
>>>> fast as client side cache, but likely cheaper (fewer OSD nodes than client
>> nodes?).
>>>> Lack of barriers in some implementations scary.
>>>
>>> This also has the benefit of caching the leveldb on the OSD, so get a big
>> performance gain from there too for small sequential writes. I looked at
>> using Flashcache for this too but decided it was adding to much complexity
>> and risk.
>>>
>>> I thought I read somewhere that RocksDB allows you to move its WAL to
>> SSD, is there anything in the pipeline for something like moving the filestore
>> to use RocksDB?
>>
>> I believe you can already do this, though I haven't tested it. You can certainly
>> move the monitors to rocksdb (tested) and newstore uses rocksdb as well.
>>
>
> Interesting, I might have a look into this.
>
>>>
>>>>
>>>> 3) Ceph Cache Tiering. Network overhead and write amplification on
>>>> promotion makes this primarily useful when workloads fit mostly into the
>>>> cache tier. Overall safe design but care must be taken to not over-
>> promote.
>>>>
>>>> 4) separate SSD pool. Manual and not particularly flexible, but perhaps
>> best
>>>> for applications that need consistently high performance.
>>>
>>> I think it depends on the definition of performance. Currently even very
>> fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of
>> write latency. If your performance requirements are for large queue depths
>> then you will probably be alright. If you require something that mirrors the
>> performance of traditional write back cache, then even pure SSD Pools can
>> start to struggle.
>>
>> Agreed. This is definitely the crux of the problem. The example below
>> is a great start! It'd would be fantastic if we could get more feedback
>> from the list on the relative importance of low latency operations vs
>> high IOPS through concurrency. We have general suspicions but not a ton
>> of actual data regarding what folks are seeing in practice and under
>> what scenarios.
>>
>
> If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs.

Probably the big question is what are the pain points? The most common
answer we get when asking folks what applications they run on top of
Ceph is "everything!". This is wonderful, but not helpful when trying
to figure out what performance issues matter most! :)

IE, should we be focusing on IOPS? Latency? Finding a way to avoid
journal overhead for large writes? Are there specific use cases where
we should specifically be focusing attention? general iscsi? S3?
databases directly on RBD? etc. There's tons of different areas that we
can work on (general OSD threading improvements, different messenger
implementations, newstore, client side bottlenecks, etc) but all of
those things tackle different kinds of problems.

Mark

Alex Gorbachev

2015-08-18 18:08:24 UTC

Permalink

> IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal
> overhead for large writes? Are there specific use cases where we should
> specifically be focusing attention? general iscsi? S3? databases directly
> on RBD? etc. There's tons of different areas that we can work on (general
> OSD threading improvements, different messenger implementations, newstore,
> client side bottlenecks, etc) but all of those things tackle different kinds
> of problems.
>

Mark, my take is definitely write latency. Base on this discussion,
there is no real safe solution for write caching outside Ceph.

Nick Fisk

2015-08-18 19:48:26 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 18 August 2015 18:51
> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
>
>
> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > <snip>
> >>>>
> >>>> Here's kind of how I see the field right now:
> >>>>
> >>>> 1) Cache at the client level. Likely fastest but obvious issues like
above.
> >>>> RAID1 might be an option at increased cost. Lack of barriers in
> >>>> some implementations scary.
> >>>
> >>> Agreed.
> >>>
> >>>>
> >>>> 2) Cache below the OSD. Not much recent data on this. Not likely
> >>>> as fast as client side cache, but likely cheaper (fewer OSD nodes
> >>>> than client
> >> nodes?).
> >>>> Lack of barriers in some implementations scary.
> >>>
> >>> This also has the benefit of caching the leveldb on the OSD, so get
> >>> a big
> >> performance gain from there too for small sequential writes. I looked
> >> at using Flashcache for this too but decided it was adding to much
> >> complexity and risk.
> >>>
> >>> I thought I read somewhere that RocksDB allows you to move its WAL
> >>> to
> >> SSD, is there anything in the pipeline for something like moving the
> >> filestore to use RocksDB?
> >>
> >> I believe you can already do this, though I haven't tested it. You
> >> can certainly move the monitors to rocksdb (tested) and newstore uses
> rocksdb as well.
> >>
> >
> > Interesting, I might have a look into this.
> >
> >>>
> >>>>
> >>>> 3) Ceph Cache Tiering. Network overhead and write amplification on
> >>>> promotion makes this primarily useful when workloads fit mostly
> >>>> into the cache tier. Overall safe design but care must be taken to
> >>>> not over-
> >> promote.
> >>>>
> >>>> 4) separate SSD pool. Manual and not particularly flexible, but
> >>>> perhaps
> >> best
> >>>> for applications that need consistently high performance.
> >>>
> >>> I think it depends on the definition of performance. Currently even
> >>> very
> >> fast CPU's and SSD's in their own pool will still struggle to get
> >> less than 1ms of write latency. If your performance requirements are
> >> for large queue depths then you will probably be alright. If you
> >> require something that mirrors the performance of traditional write
> >> back cache, then even pure SSD Pools can start to struggle.
> >>
> >> Agreed. This is definitely the crux of the problem. The example
> >> below is a great start! It'd would be fantastic if we could get more
> >> feedback from the list on the relative importance of low latency
> >> operations vs high IOPS through concurrency. We have general
> >> suspicions but not a ton of actual data regarding what folks are
> >> seeing in practice and under what scenarios.
> >>
> >
> > If you have any specific questions that you think I might be able to
answer,
> please let me know. The only other main app that I can really think of
where
> these sort of write latency is critical is SQL, particularly the
transaction logs.
>
> Probably the big question is what are the pain points? The most common
> answer we get when asking folks what applications they run on top of Ceph
> is "everything!". This is wonderful, but not helpful when trying to
figure out
> what performance issues matter most! :)

Sort of like someone telling you their pc is broken and when asked for
details getting "It's not working" in return.

In general I think a lot of it comes down to people not appreciating the
differences between Ceph and say a Raid array. For most things like larger
block IO performance tends to scale with cluster size and the cost
effectiveness of Ceph makes this a no brainer not to just add a handful of
extra OSD's.

I will try and be more precise. Here is my list of pain points / wishes that
I have come across in the last 12 months of running Ceph.

1. Improve small IO write latency
As discussed in depth in this thread. If it's possible just to make Ceph a
lot faster then great, but I fear even a doubling in performance will still
fall short compared to if you are caching writes at the client. Most things
in Ceph tend to improve with scale, but write latency is the same with 2
OSD's as it is with 2000. I would urge some sort of investigation into the
possibility of some sort of persistent librbd caching. This will probably
help across a large number of scenarios, as in the end, most things are
effected by latency and I think will provide across the board improvements.

2. Cache Tiering
I know a lot of work is going into this currently, but I will cover my
experience.
2A)Deletion of large RBD's takes forever. It seems to have to promote all
objects, even non-existent ones to the cache tier before it can delete them.
Operationally this is really poor as it has a negative effect on the cache
tier contents as well.
2B) Erasure Coding requires all writes to be promoted 1st. I think it should
be pretty easy to allow proxy writes for erasure coded pools if the IO size
= Object Size. A lot of backup applications can be configured to write out
in static sized blocks and would be an ideal candidate for this sort of
enhancement.
2C) General Performance, hopefully this will be fixed by upcoming changes.
2D) Don't count consecutive sequential reads to the same object as a trigger
for promotion. I currently have problems where reading sequentially through
a large RBD, causes it to be completely promoted because the read IO size is
smaller than the underlying object size.

3. Kernel RBD Client
Either implement striping or see if it's possible to configure readahead
+max_sectors_kb size to be larger than the object size. I started a thread
about this a few days ago if you are interested in more details.

4. Disk based OSD with SSD Journal performance
As I touched on above earlier, I would expect a disk based OSD with SSD
journal to have similar performance to a pure SSD OSD when dealing with
sequential small IO's. Currently the levelDB sync and potentially other
things slow this down.

5. iSCSI
I know Mike Christie is doing a lot of good work in getting LIO to work with
Ceph, but currently it feels like a bit of a amateur affair getting it
going.

6. Slow xattr problem
I've a weird problem a couple of times, where RBD's with data that hasn't
been written to for a while seem to start performing reads very slowly. With
the help of Somnath in a thread here we managed to track it down to a xattr
taking very long to be retrieved, but no idea why. Overwriting the RBD with
fresh data seemed to stop it happening. Hopefully Newstore might stop this
happening in the future.

>
> IE, should we be focusing on IOPS? Latency? Finding a way to avoid
journal
> overhead for large writes? Are there specific use cases where we should
> specifically be focusing attention? general iscsi? S3?
> databases directly on RBD? etc. There's tons of different areas that we
can
> work on (general OSD threading improvements, different messenger
> implementations, newstore, client side bottlenecks, etc) but all of those
> things tackle different kinds of problems.
>
> Mark
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Samuel Just

2015-08-18 20:38:02 UTC

Permalink

1. We've kicked this around a bit. What kind of failure semantics
would you be comfortable with here (that is, what would be reasonable
behavior if the client side cache fails)?
2. We've got a branch which should merge soon (tomorrow probably)
which actually does allow writes to be proxied, so that should
alleviate some of these pain points somewhat. I'm not sure it is
clever enough to allow through writefulls for an ec base tier though
(but it would be a good idea!)
-Sam

On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk> wrote:
>
>
>
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 18 August 2015 18:51
>> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
>> Cc: ceph-***@lists.ceph.com
>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>
>>
>>
>> On 08/18/2015 11:52 AM, Nick Fisk wrote:
>> > <snip>
>> >>>>
>> >>>> Here's kind of how I see the field right now:
>> >>>>
>> >>>> 1) Cache at the client level. Likely fastest but obvious issues like
> above.
>> >>>> RAID1 might be an option at increased cost. Lack of barriers in
>> >>>> some implementations scary.
>> >>>
>> >>> Agreed.
>> >>>
>> >>>>
>> >>>> 2) Cache below the OSD. Not much recent data on this. Not likely
>> >>>> as fast as client side cache, but likely cheaper (fewer OSD nodes
>> >>>> than client
>> >> nodes?).
>> >>>> Lack of barriers in some implementations scary.
>> >>>
>> >>> This also has the benefit of caching the leveldb on the OSD, so get
>> >>> a big
>> >> performance gain from there too for small sequential writes. I looked
>> >> at using Flashcache for this too but decided it was adding to much
>> >> complexity and risk.
>> >>>
>> >>> I thought I read somewhere that RocksDB allows you to move its WAL
>> >>> to
>> >> SSD, is there anything in the pipeline for something like moving the
>> >> filestore to use RocksDB?
>> >>
>> >> I believe you can already do this, though I haven't tested it. You
>> >> can certainly move the monitors to rocksdb (tested) and newstore uses
>> rocksdb as well.
>> >>
>> >
>> > Interesting, I might have a look into this.
>> >
>> >>>
>> >>>>
>> >>>> 3) Ceph Cache Tiering. Network overhead and write amplification on
>> >>>> promotion makes this primarily useful when workloads fit mostly
>> >>>> into the cache tier. Overall safe design but care must be taken to
>> >>>> not over-
>> >> promote.
>> >>>>
>> >>>> 4) separate SSD pool. Manual and not particularly flexible, but
>> >>>> perhaps
>> >> best
>> >>>> for applications that need consistently high performance.
>> >>>
>> >>> I think it depends on the definition of performance. Currently even
>> >>> very
>> >> fast CPU's and SSD's in their own pool will still struggle to get
>> >> less than 1ms of write latency. If your performance requirements are
>> >> for large queue depths then you will probably be alright. If you
>> >> require something that mirrors the performance of traditional write
>> >> back cache, then even pure SSD Pools can start to struggle.
>> >>
>> >> Agreed. This is definitely the crux of the problem. The example
>> >> below is a great start! It'd would be fantastic if we could get more
>> >> feedback from the list on the relative importance of low latency
>> >> operations vs high IOPS through concurrency. We have general
>> >> suspicions but not a ton of actual data regarding what folks are
>> >> seeing in practice and under what scenarios.
>> >>
>> >
>> > If you have any specific questions that you think I might be able to
> answer,
>> please let me know. The only other main app that I can really think of
> where
>> these sort of write latency is critical is SQL, particularly the
> transaction logs.
>>
>> Probably the big question is what are the pain points? The most common
>> answer we get when asking folks what applications they run on top of Ceph
>> is "everything!". This is wonderful, but not helpful when trying to
> figure out
>> what performance issues matter most! :)
>
> Sort of like someone telling you their pc is broken and when asked for
> details getting "It's not working" in return.
>
> In general I think a lot of it comes down to people not appreciating the
> differences between Ceph and say a Raid array. For most things like larger
> block IO performance tends to scale with cluster size and the cost
> effectiveness of Ceph makes this a no brainer not to just add a handful of
> extra OSD's.
>
> I will try and be more precise. Here is my list of pain points / wishes that
> I have come across in the last 12 months of running Ceph.
>
> 1. Improve small IO write latency
> As discussed in depth in this thread. If it's possible just to make Ceph a
> lot faster then great, but I fear even a doubling in performance will still
> fall short compared to if you are caching writes at the client. Most things
> in Ceph tend to improve with scale, but write latency is the same with 2
> OSD's as it is with 2000. I would urge some sort of investigation into the
> possibility of some sort of persistent librbd caching. This will probably
> help across a large number of scenarios, as in the end, most things are
> effected by latency and I think will provide across the board improvements.
>
> 2. Cache Tiering
> I know a lot of work is going into this currently, but I will cover my
> experience.
> 2A)Deletion of large RBD's takes forever. It seems to have to promote all
> objects, even non-existent ones to the cache tier before it can delete them.
> Operationally this is really poor as it has a negative effect on the cache
> tier contents as well.
> 2B) Erasure Coding requires all writes to be promoted 1st. I think it should
> be pretty easy to allow proxy writes for erasure coded pools if the IO size
> = Object Size. A lot of backup applications can be configured to write out
> in static sized blocks and would be an ideal candidate for this sort of
> enhancement.
> 2C) General Performance, hopefully this will be fixed by upcoming changes.
> 2D) Don't count consecutive sequential reads to the same object as a trigger
> for promotion. I currently have problems where reading sequentially through
> a large RBD, causes it to be completely promoted because the read IO size is
> smaller than the underlying object size.
>
> 3. Kernel RBD Client
> Either implement striping or see if it's possible to configure readahead
> +max_sectors_kb size to be larger than the object size. I started a thread
> about this a few days ago if you are interested in more details.
>
> 4. Disk based OSD with SSD Journal performance
> As I touched on above earlier, I would expect a disk based OSD with SSD
> journal to have similar performance to a pure SSD OSD when dealing with
> sequential small IO's. Currently the levelDB sync and potentially other
> things slow this down.
>
> 5. iSCSI
> I know Mike Christie is doing a lot of good work in getting LIO to work with
> Ceph, but currently it feels like a bit of a amateur affair getting it
> going.
>
> 6. Slow xattr problem
> I've a weird problem a couple of times, where RBD's with data that hasn't
> been written to for a while seem to start performing reads very slowly. With
> the help of Somnath in a thread here we managed to track it down to a xattr
> taking very long to be retrieved, but no idea why. Overwriting the RBD with
> fresh data seemed to stop it happening. Hopefully Newstore might stop this
> happening in the future.
>
>>
>> IE, should we be focusing on IOPS? Latency? Finding a way to avoid
> journal
>> overhead for large writes? Are there specific use cases where we should
>> specifically be focusing attention? general iscsi? S3?
>> databases directly on RBD? etc. There's tons of different areas that we
> can
>> work on (general OSD threading improvements, different messenger
>> implementations, newstore, client side bottlenecks, etc) but all of those
>> things tackle different kinds of problems.
>>
>> Mark
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Nick Fisk

2015-08-18 21:24:38 UTC

Permalink

Hi Sam,

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Samuel Just
> Sent: 18 August 2015 21:38
> To: Nick Fisk <***@fisk.me.uk>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> 1. We've kicked this around a bit. What kind of failure semantics would
you
> be comfortable with here (that is, what would be reasonable behavior if
the
> client side cache fails)?

I would either expect to provide the cache with a redundant block device (ie
RAID1 SSD's) or the cache to allow itself to be configured to mirror across
two SSD's. Of course single SSD's can be used if the user accepts the risk.
If the cache did the mirroring then you could do fancy stuff like mirror the
writes, but leave the read cache blocks as single copies to increase the
cache capacity.

In either case although an outage is undesirable, its only data loss which
would be unacceptable, which would hopefully be avoided by the mirroring. As
part of this, it would need to be a way to make sure a "dirty" RBD can't be
accessed unless the corresponding cache is also attached.

I guess as it caching the RBD and not the pool or entire cluster, the cache
only needs to match the failure requirements of the application its caching.
If I need to cache a RBD that is on a single server, there is no
requirement to make the cache redundant across racks/PDU's/servers...etc.

I hope I've answered your question?

> 2. We've got a branch which should merge soon (tomorrow probably) which
> actually does allow writes to be proxied, so that should alleviate some of
> these pain points somewhat. I'm not sure it is clever enough to allow
> through writefulls for an ec base tier though (but it would be a good
idea!) -

Excellent news, I shall look forward to testing in the future. I did mention
the proxy write for write fulls to someone who was working on the proxy
write code, but I'm not sure if it ever got followed up.

> Sam
>
> On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk> wrote:
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> >> Of Mark Nelson
> >> Sent: 18 August 2015 18:51
> >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
> >> Cc: ceph-***@lists.ceph.com
> >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>
> >>
> >>
> >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> >> > <snip>
> >> >>>>
> >> >>>> Here's kind of how I see the field right now:
> >> >>>>
> >> >>>> 1) Cache at the client level. Likely fastest but obvious issues
> >> >>>> like
> > above.
> >> >>>> RAID1 might be an option at increased cost. Lack of barriers in
> >> >>>> some implementations scary.
> >> >>>
> >> >>> Agreed.
> >> >>>
> >> >>>>
> >> >>>> 2) Cache below the OSD. Not much recent data on this. Not
> >> >>>> likely as fast as client side cache, but likely cheaper (fewer
> >> >>>> OSD nodes than client
> >> >> nodes?).
> >> >>>> Lack of barriers in some implementations scary.
> >> >>>
> >> >>> This also has the benefit of caching the leveldb on the OSD, so
> >> >>> get a big
> >> >> performance gain from there too for small sequential writes. I
> >> >> looked at using Flashcache for this too but decided it was adding
> >> >> to much complexity and risk.
> >> >>>
> >> >>> I thought I read somewhere that RocksDB allows you to move its
> >> >>> WAL to
> >> >> SSD, is there anything in the pipeline for something like moving
> >> >> the filestore to use RocksDB?
> >> >>
> >> >> I believe you can already do this, though I haven't tested it.
> >> >> You can certainly move the monitors to rocksdb (tested) and
> >> >> newstore uses
> >> rocksdb as well.
> >> >>
> >> >
> >> > Interesting, I might have a look into this.
> >> >
> >> >>>
> >> >>>>
> >> >>>> 3) Ceph Cache Tiering. Network overhead and write amplification
> >> >>>> on promotion makes this primarily useful when workloads fit
> >> >>>> mostly into the cache tier. Overall safe design but care must
> >> >>>> be taken to not over-
> >> >> promote.
> >> >>>>
> >> >>>> 4) separate SSD pool. Manual and not particularly flexible, but
> >> >>>> perhaps
> >> >> best
> >> >>>> for applications that need consistently high performance.
> >> >>>
> >> >>> I think it depends on the definition of performance. Currently
> >> >>> even very
> >> >> fast CPU's and SSD's in their own pool will still struggle to get
> >> >> less than 1ms of write latency. If your performance requirements
> >> >> are for large queue depths then you will probably be alright. If
> >> >> you require something that mirrors the performance of traditional
> >> >> write back cache, then even pure SSD Pools can start to struggle.
> >> >>
> >> >> Agreed. This is definitely the crux of the problem. The example
> >> >> below is a great start! It'd would be fantastic if we could get
> >> >> more feedback from the list on the relative importance of low
> >> >> latency operations vs high IOPS through concurrency. We have
> >> >> general suspicions but not a ton of actual data regarding what
> >> >> folks are seeing in practice and under what scenarios.
> >> >>
> >> >
> >> > If you have any specific questions that you think I might be able
> >> > to
> > answer,
> >> please let me know. The only other main app that I can really think
> >> of
> > where
> >> these sort of write latency is critical is SQL, particularly the
> > transaction logs.
> >>
> >> Probably the big question is what are the pain points? The most
> >> common answer we get when asking folks what applications they run on
> >> top of Ceph is "everything!". This is wonderful, but not helpful
> >> when trying to
> > figure out
> >> what performance issues matter most! :)
> >
> > Sort of like someone telling you their pc is broken and when asked for
> > details getting "It's not working" in return.
> >
> > In general I think a lot of it comes down to people not appreciating
> > the differences between Ceph and say a Raid array. For most things
> > like larger block IO performance tends to scale with cluster size and
> > the cost effectiveness of Ceph makes this a no brainer not to just add
> > a handful of extra OSD's.
> >
> > I will try and be more precise. Here is my list of pain points /
> > wishes that I have come across in the last 12 months of running Ceph.
> >
> > 1. Improve small IO write latency
> > As discussed in depth in this thread. If it's possible just to make
> > Ceph a lot faster then great, but I fear even a doubling in
> > performance will still fall short compared to if you are caching
> > writes at the client. Most things in Ceph tend to improve with scale,
> > but write latency is the same with 2 OSD's as it is with 2000. I would
> > urge some sort of investigation into the possibility of some sort of
> > persistent librbd caching. This will probably help across a large
> > number of scenarios, as in the end, most things are effected by latency
and
> I think will provide across the board improvements.
> >
> > 2. Cache Tiering
> > I know a lot of work is going into this currently, but I will cover my
> > experience.
> > 2A)Deletion of large RBD's takes forever. It seems to have to promote
> > all objects, even non-existent ones to the cache tier before it can
delete
> them.
> > Operationally this is really poor as it has a negative effect on the
> > cache tier contents as well.
> > 2B) Erasure Coding requires all writes to be promoted 1st. I think it
> > should be pretty easy to allow proxy writes for erasure coded pools if
> > the IO size = Object Size. A lot of backup applications can be
> > configured to write out in static sized blocks and would be an ideal
> > candidate for this sort of enhancement.
> > 2C) General Performance, hopefully this will be fixed by upcoming
changes.
> > 2D) Don't count consecutive sequential reads to the same object as a
> > trigger for promotion. I currently have problems where reading
> > sequentially through a large RBD, causes it to be completely promoted
> > because the read IO size is smaller than the underlying object size.
> >
> > 3. Kernel RBD Client
> > Either implement striping or see if it's possible to configure
> > readahead
> > +max_sectors_kb size to be larger than the object size. I started a
> > +thread
> > about this a few days ago if you are interested in more details.
> >
> > 4. Disk based OSD with SSD Journal performance As I touched on above
> > earlier, I would expect a disk based OSD with SSD journal to have
> > similar performance to a pure SSD OSD when dealing with sequential
> > small IO's. Currently the levelDB sync and potentially other things
> > slow this down.
> >
> > 5. iSCSI
> > I know Mike Christie is doing a lot of good work in getting LIO to
> > work with Ceph, but currently it feels like a bit of a amateur affair
> > getting it going.
> >
> > 6. Slow xattr problem
> > I've a weird problem a couple of times, where RBD's with data that
> > hasn't been written to for a while seem to start performing reads very
> > slowly. With the help of Somnath in a thread here we managed to track
> > it down to a xattr taking very long to be retrieved, but no idea why.
> > Overwriting the RBD with fresh data seemed to stop it happening.
> > Hopefully Newstore might stop this happening in the future.
> >
> >>
> >> IE, should we be focusing on IOPS? Latency? Finding a way to avoid
> > journal
> >> overhead for large writes? Are there specific use cases where we
> >> should specifically be focusing attention? general iscsi? S3?
> >> databases directly on RBD? etc. There's tons of different areas that
> >> we
> > can
> >> work on (general OSD threading improvements, different messenger
> >> implementations, newstore, client side bottlenecks, etc) but all of
> >> those things tackle different kinds of problems.
> >>
> >> Mark
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Wang, Zhiqiang

2015-09-01 01:47:42 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: Wednesday, August 19, 2015 5:25 AM
> To: 'Samuel Just'
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> Hi Sam,
>
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> > Of Samuel Just
> > Sent: 18 August 2015 21:38
> > To: Nick Fisk <***@fisk.me.uk>
> > Cc: ceph-***@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > 1. We've kicked this around a bit. What kind of failure semantics
> > would
> you
> > be comfortable with here (that is, what would be reasonable behavior
> > if
> the
> > client side cache fails)?
>
> I would either expect to provide the cache with a redundant block device (ie
> RAID1 SSD's) or the cache to allow itself to be configured to mirror across two
> SSD's. Of course single SSD's can be used if the user accepts the risk.
> If the cache did the mirroring then you could do fancy stuff like mirror the
> writes, but leave the read cache blocks as single copies to increase the cache
> capacity.
>
> In either case although an outage is undesirable, its only data loss which would
> be unacceptable, which would hopefully be avoided by the mirroring. As part of
> this, it would need to be a way to make sure a "dirty" RBD can't be accessed
> unless the corresponding cache is also attached.
>
> I guess as it caching the RBD and not the pool or entire cluster, the cache only
> needs to match the failure requirements of the application its caching.
> If I need to cache a RBD that is on a single server, there is no requirement to
> make the cache redundant across racks/PDU's/servers...etc.
>
> I hope I've answered your question?
>
>
> > 2. We've got a branch which should merge soon (tomorrow probably)
> > which actually does allow writes to be proxied, so that should
> > alleviate some of these pain points somewhat. I'm not sure it is
> > clever enough to allow through writefulls for an ec base tier though
> > (but it would be a good
> idea!) -
>
> Excellent news, I shall look forward to testing in the future. I did mention the
> proxy write for write fulls to someone who was working on the proxy write code,
> but I'm not sure if it ever got followed up.

I think someone here is me. In the current code, for an ec base tier, writefull can be proxied to the base.

>
> > Sam
> >
> > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk> wrote:
> > >
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > >> Behalf Of Mark Nelson
> > >> Sent: 18 August 2015 18:51
> > >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
> > >> Cc: ceph-***@lists.ceph.com
> > >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >>
> > >>
> > >>
> > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > >> > <snip>
> > >> >>>>
> > >> >>>> Here's kind of how I see the field right now:
> > >> >>>>
> > >> >>>> 1) Cache at the client level. Likely fastest but obvious
> > >> >>>> issues like
> > > above.
> > >> >>>> RAID1 might be an option at increased cost. Lack of barriers
> > >> >>>> in some implementations scary.
> > >> >>>
> > >> >>> Agreed.
> > >> >>>
> > >> >>>>
> > >> >>>> 2) Cache below the OSD. Not much recent data on this. Not
> > >> >>>> likely as fast as client side cache, but likely cheaper (fewer
> > >> >>>> OSD nodes than client
> > >> >> nodes?).
> > >> >>>> Lack of barriers in some implementations scary.
> > >> >>>
> > >> >>> This also has the benefit of caching the leveldb on the OSD, so
> > >> >>> get a big
> > >> >> performance gain from there too for small sequential writes. I
> > >> >> looked at using Flashcache for this too but decided it was
> > >> >> adding to much complexity and risk.
> > >> >>>
> > >> >>> I thought I read somewhere that RocksDB allows you to move its
> > >> >>> WAL to
> > >> >> SSD, is there anything in the pipeline for something like moving
> > >> >> the filestore to use RocksDB?
> > >> >>
> > >> >> I believe you can already do this, though I haven't tested it.
> > >> >> You can certainly move the monitors to rocksdb (tested) and
> > >> >> newstore uses
> > >> rocksdb as well.
> > >> >>
> > >> >
> > >> > Interesting, I might have a look into this.
> > >> >
> > >> >>>
> > >> >>>>
> > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > >> >>>> amplification on promotion makes this primarily useful when
> > >> >>>> workloads fit mostly into the cache tier. Overall safe design
> > >> >>>> but care must be taken to not over-
> > >> >> promote.
> > >> >>>>
> > >> >>>> 4) separate SSD pool. Manual and not particularly flexible,
> > >> >>>> but perhaps
> > >> >> best
> > >> >>>> for applications that need consistently high performance.
> > >> >>>
> > >> >>> I think it depends on the definition of performance. Currently
> > >> >>> even very
> > >> >> fast CPU's and SSD's in their own pool will still struggle to
> > >> >> get less than 1ms of write latency. If your performance
> > >> >> requirements are for large queue depths then you will probably
> > >> >> be alright. If you require something that mirrors the
> > >> >> performance of traditional write back cache, then even pure SSD Pools
> can start to struggle.
> > >> >>
> > >> >> Agreed. This is definitely the crux of the problem. The
> > >> >> example below is a great start! It'd would be fantastic if we
> > >> >> could get more feedback from the list on the relative importance
> > >> >> of low latency operations vs high IOPS through concurrency. We
> > >> >> have general suspicions but not a ton of actual data regarding
> > >> >> what folks are seeing in practice and under what scenarios.
> > >> >>
> > >> >
> > >> > If you have any specific questions that you think I might be able
> > >> > to
> > > answer,
> > >> please let me know. The only other main app that I can really think
> > >> of
> > > where
> > >> these sort of write latency is critical is SQL, particularly the
> > > transaction logs.
> > >>
> > >> Probably the big question is what are the pain points? The most
> > >> common answer we get when asking folks what applications they run
> > >> on top of Ceph is "everything!". This is wonderful, but not
> > >> helpful when trying to
> > > figure out
> > >> what performance issues matter most! :)
> > >
> > > Sort of like someone telling you their pc is broken and when asked
> > > for details getting "It's not working" in return.
> > >
> > > In general I think a lot of it comes down to people not appreciating
> > > the differences between Ceph and say a Raid array. For most things
> > > like larger block IO performance tends to scale with cluster size
> > > and the cost effectiveness of Ceph makes this a no brainer not to
> > > just add a handful of extra OSD's.
> > >
> > > I will try and be more precise. Here is my list of pain points /
> > > wishes that I have come across in the last 12 months of running Ceph.
> > >
> > > 1. Improve small IO write latency
> > > As discussed in depth in this thread. If it's possible just to make
> > > Ceph a lot faster then great, but I fear even a doubling in
> > > performance will still fall short compared to if you are caching
> > > writes at the client. Most things in Ceph tend to improve with
> > > scale, but write latency is the same with 2 OSD's as it is with
> > > 2000. I would urge some sort of investigation into the possibility
> > > of some sort of persistent librbd caching. This will probably help
> > > across a large number of scenarios, as in the end, most things are
> > > effected by latency
> and
> > I think will provide across the board improvements.
> > >
> > > 2. Cache Tiering
> > > I know a lot of work is going into this currently, but I will cover
> > > my experience.
> > > 2A)Deletion of large RBD's takes forever. It seems to have to
> > > promote all objects, even non-existent ones to the cache tier before
> > > it can
> delete
> > them.
> > > Operationally this is really poor as it has a negative effect on the
> > > cache tier contents as well.
> > > 2B) Erasure Coding requires all writes to be promoted 1st. I think
> > > it should be pretty easy to allow proxy writes for erasure coded
> > > pools if the IO size = Object Size. A lot of backup applications can
> > > be configured to write out in static sized blocks and would be an
> > > ideal candidate for this sort of enhancement.
> > > 2C) General Performance, hopefully this will be fixed by upcoming
> changes.
> > > 2D) Don't count consecutive sequential reads to the same object as a
> > > trigger for promotion. I currently have problems where reading
> > > sequentially through a large RBD, causes it to be completely
> > > promoted because the read IO size is smaller than the underlying object
> size.
> > >
> > > 3. Kernel RBD Client
> > > Either implement striping or see if it's possible to configure
> > > readahead
> > > +max_sectors_kb size to be larger than the object size. I started a
> > > +thread
> > > about this a few days ago if you are interested in more details.
> > >
> > > 4. Disk based OSD with SSD Journal performance As I touched on above
> > > earlier, I would expect a disk based OSD with SSD journal to have
> > > similar performance to a pure SSD OSD when dealing with sequential
> > > small IO's. Currently the levelDB sync and potentially other things
> > > slow this down.
> > >
> > > 5. iSCSI
> > > I know Mike Christie is doing a lot of good work in getting LIO to
> > > work with Ceph, but currently it feels like a bit of a amateur
> > > affair getting it going.
> > >
> > > 6. Slow xattr problem
> > > I've a weird problem a couple of times, where RBD's with data that
> > > hasn't been written to for a while seem to start performing reads
> > > very slowly. With the help of Somnath in a thread here we managed to
> > > track it down to a xattr taking very long to be retrieved, but no idea why.
> > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > Hopefully Newstore might stop this happening in the future.
> > >
> > >>
> > >> IE, should we be focusing on IOPS? Latency? Finding a way to
> > >> avoid
> > > journal
> > >> overhead for large writes? Are there specific use cases where we
> > >> should specifically be focusing attention? general iscsi? S3?
> > >> databases directly on RBD? etc. There's tons of different areas
> > >> that we
> > > can
> > >> work on (general OSD threading improvements, different messenger
> > >> implementations, newstore, client side bottlenecks, etc) but all of
> > >> those things tackle different kinds of problems.
> > >>
> > >> Mark
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-***@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Nick Fisk

2015-09-01 07:54:46 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 02:48
> To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> > Of Nick Fisk
> > Sent: Wednesday, August 19, 2015 5:25 AM
> > To: 'Samuel Just'
> > Cc: ceph-***@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > Hi Sam,
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > Behalf Of Samuel Just
> > > Sent: 18 August 2015 21:38
> > > To: Nick Fisk <***@fisk.me.uk>
> > > Cc: ceph-***@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > 1. We've kicked this around a bit. What kind of failure semantics
> > > would
> > you
> > > be comfortable with here (that is, what would be reasonable behavior
> > > if
> > the
> > > client side cache fails)?
> >
> > I would either expect to provide the cache with a redundant block
> > device (ie
> > RAID1 SSD's) or the cache to allow itself to be configured to mirror
> > across two SSD's. Of course single SSD's can be used if the user accepts
the
> risk.
> > If the cache did the mirroring then you could do fancy stuff like
> > mirror the writes, but leave the read cache blocks as single copies to
> > increase the cache capacity.
> >
> > In either case although an outage is undesirable, its only data loss
> > which would be unacceptable, which would hopefully be avoided by the
> > mirroring. As part of this, it would need to be a way to make sure a
> > "dirty" RBD can't be accessed unless the corresponding cache is also
> attached.
> >
> > I guess as it caching the RBD and not the pool or entire cluster, the
> > cache only needs to match the failure requirements of the application
its
> caching.
> > If I need to cache a RBD that is on a single server, there is no
> > requirement to make the cache redundant across
> racks/PDU's/servers...etc.
> >
> > I hope I've answered your question?
> >
> >
> > > 2. We've got a branch which should merge soon (tomorrow probably)
> > > which actually does allow writes to be proxied, so that should
> > > alleviate some of these pain points somewhat. I'm not sure it is
> > > clever enough to allow through writefulls for an ec base tier though
> > > (but it would be a good
> > idea!) -
> >
> > Excellent news, I shall look forward to testing in the future. I did
> > mention the proxy write for write fulls to someone who was working on
> > the proxy write code, but I'm not sure if it ever got followed up.
>
> I think someone here is me. In the current code, for an ec base tier,
writefull
> can be proxied to the base.

Excellent news. Is this intelligent enough to determine when say a normal
write IO from a RBD is equal to the underlying object size and then turn
this normal write effectively into a write full?

>
> >
> > > Sam
> > >
> > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk> wrote:
> > > >
> > > >
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > >> Behalf Of Mark Nelson
> > > >> Sent: 18 August 2015 18:51
> > > >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
> > > >> Cc: ceph-***@lists.ceph.com
> > > >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >>
> > > >>
> > > >>
> > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > >> > <snip>
> > > >> >>>>
> > > >> >>>> Here's kind of how I see the field right now:
> > > >> >>>>
> > > >> >>>> 1) Cache at the client level. Likely fastest but obvious
> > > >> >>>> issues like
> > > > above.
> > > >> >>>> RAID1 might be an option at increased cost. Lack of
> > > >> >>>> barriers in some implementations scary.
> > > >> >>>
> > > >> >>> Agreed.
> > > >> >>>
> > > >> >>>>
> > > >> >>>> 2) Cache below the OSD. Not much recent data on this. Not
> > > >> >>>> likely as fast as client side cache, but likely cheaper
> > > >> >>>> (fewer OSD nodes than client
> > > >> >> nodes?).
> > > >> >>>> Lack of barriers in some implementations scary.
> > > >> >>>
> > > >> >>> This also has the benefit of caching the leveldb on the OSD,
> > > >> >>> so get a big
> > > >> >> performance gain from there too for small sequential writes. I
> > > >> >> looked at using Flashcache for this too but decided it was
> > > >> >> adding to much complexity and risk.
> > > >> >>>
> > > >> >>> I thought I read somewhere that RocksDB allows you to move
> > > >> >>> its WAL to
> > > >> >> SSD, is there anything in the pipeline for something like
> > > >> >> moving the filestore to use RocksDB?
> > > >> >>
> > > >> >> I believe you can already do this, though I haven't tested it.
> > > >> >> You can certainly move the monitors to rocksdb (tested) and
> > > >> >> newstore uses
> > > >> rocksdb as well.
> > > >> >>
> > > >> >
> > > >> > Interesting, I might have a look into this.
> > > >> >
> > > >> >>>
> > > >> >>>>
> > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > >> >>>> amplification on promotion makes this primarily useful when
> > > >> >>>> workloads fit mostly into the cache tier. Overall safe
> > > >> >>>> design but care must be taken to not over-
> > > >> >> promote.
> > > >> >>>>
> > > >> >>>> 4) separate SSD pool. Manual and not particularly flexible,
> > > >> >>>> but perhaps
> > > >> >> best
> > > >> >>>> for applications that need consistently high performance.
> > > >> >>>
> > > >> >>> I think it depends on the definition of performance.
> > > >> >>> Currently even very
> > > >> >> fast CPU's and SSD's in their own pool will still struggle to
> > > >> >> get less than 1ms of write latency. If your performance
> > > >> >> requirements are for large queue depths then you will probably
> > > >> >> be alright. If you require something that mirrors the
> > > >> >> performance of traditional write back cache, then even pure
> > > >> >> SSD Pools
> > can start to struggle.
> > > >> >>
> > > >> >> Agreed. This is definitely the crux of the problem. The
> > > >> >> example below is a great start! It'd would be fantastic if we
> > > >> >> could get more feedback from the list on the relative
> > > >> >> importance of low latency operations vs high IOPS through
> > > >> >> concurrency. We have general suspicions but not a ton of
> > > >> >> actual data regarding what folks are seeing in practice and
under
> what scenarios.
> > > >> >>
> > > >> >
> > > >> > If you have any specific questions that you think I might be
> > > >> > able to
> > > > answer,
> > > >> please let me know. The only other main app that I can really
> > > >> think of
> > > > where
> > > >> these sort of write latency is critical is SQL, particularly the
> > > > transaction logs.
> > > >>
> > > >> Probably the big question is what are the pain points? The most
> > > >> common answer we get when asking folks what applications they run
> > > >> on top of Ceph is "everything!". This is wonderful, but not
> > > >> helpful when trying to
> > > > figure out
> > > >> what performance issues matter most! :)
> > > >
> > > > Sort of like someone telling you their pc is broken and when asked
> > > > for details getting "It's not working" in return.
> > > >
> > > > In general I think a lot of it comes down to people not
> > > > appreciating the differences between Ceph and say a Raid array.
> > > > For most things like larger block IO performance tends to scale
> > > > with cluster size and the cost effectiveness of Ceph makes this a
> > > > no brainer not to just add a handful of extra OSD's.
> > > >
> > > > I will try and be more precise. Here is my list of pain points /
> > > > wishes that I have come across in the last 12 months of running
Ceph.
> > > >
> > > > 1. Improve small IO write latency
> > > > As discussed in depth in this thread. If it's possible just to
> > > > make Ceph a lot faster then great, but I fear even a doubling in
> > > > performance will still fall short compared to if you are caching
> > > > writes at the client. Most things in Ceph tend to improve with
> > > > scale, but write latency is the same with 2 OSD's as it is with
> > > > 2000. I would urge some sort of investigation into the possibility
> > > > of some sort of persistent librbd caching. This will probably help
> > > > across a large number of scenarios, as in the end, most things are
> > > > effected by latency
> > and
> > > I think will provide across the board improvements.
> > > >
> > > > 2. Cache Tiering
> > > > I know a lot of work is going into this currently, but I will
> > > > cover my experience.
> > > > 2A)Deletion of large RBD's takes forever. It seems to have to
> > > > promote all objects, even non-existent ones to the cache tier
> > > > before it can
> > delete
> > > them.
> > > > Operationally this is really poor as it has a negative effect on
> > > > the cache tier contents as well.
> > > > 2B) Erasure Coding requires all writes to be promoted 1st. I think
> > > > it should be pretty easy to allow proxy writes for erasure coded
> > > > pools if the IO size = Object Size. A lot of backup applications
> > > > can be configured to write out in static sized blocks and would be
> > > > an ideal candidate for this sort of enhancement.
> > > > 2C) General Performance, hopefully this will be fixed by upcoming
> > changes.
> > > > 2D) Don't count consecutive sequential reads to the same object as
> > > > a trigger for promotion. I currently have problems where reading
> > > > sequentially through a large RBD, causes it to be completely
> > > > promoted because the read IO size is smaller than the underlying
> > > > object
> > size.
> > > >
> > > > 3. Kernel RBD Client
> > > > Either implement striping or see if it's possible to configure
> > > > readahead
> > > > +max_sectors_kb size to be larger than the object size. I started
> > > > +a thread
> > > > about this a few days ago if you are interested in more details.
> > > >
> > > > 4. Disk based OSD with SSD Journal performance As I touched on
> > > > above earlier, I would expect a disk based OSD with SSD journal to
> > > > have similar performance to a pure SSD OSD when dealing with
> > > > sequential small IO's. Currently the levelDB sync and potentially
> > > > other things slow this down.
> > > >
> > > > 5. iSCSI
> > > > I know Mike Christie is doing a lot of good work in getting LIO to
> > > > work with Ceph, but currently it feels like a bit of a amateur
> > > > affair getting it going.
> > > >
> > > > 6. Slow xattr problem
> > > > I've a weird problem a couple of times, where RBD's with data that
> > > > hasn't been written to for a while seem to start performing reads
> > > > very slowly. With the help of Somnath in a thread here we managed
> > > > to track it down to a xattr taking very long to be retrieved, but no
idea
> why.
> > > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > > Hopefully Newstore might stop this happening in the future.
> > > >
> > > >>
> > > >> IE, should we be focusing on IOPS? Latency? Finding a way to
> > > >> avoid
> > > > journal
> > > >> overhead for large writes? Are there specific use cases where we
> > > >> should specifically be focusing attention? general iscsi? S3?
> > > >> databases directly on RBD? etc. There's tons of different areas
> > > >> that we
> > > > can
> > > >> work on (general OSD threading improvements, different messenger
> > > >> implementations, newstore, client side bottlenecks, etc) but all
> > > >> of those things tackle different kinds of problems.
> > > >>
> > > >> Mark
> > > >> _______________________________________________
> > > >> ceph-users mailing list
> > > >> ceph-***@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-***@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Wang, Zhiqiang

2015-09-01 08:17:35 UTC

Permalink

> -----Original Message-----
> From: Nick Fisk [mailto:***@fisk.me.uk]
> Sent: Tuesday, September 1, 2015 3:55 PM
> To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> Cc: ceph-***@lists.ceph.com
> Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
>
>
>
>
>
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> > Of Wang, Zhiqiang
> > Sent: 01 September 2015 02:48
> > To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> > Cc: ceph-***@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > Behalf Of Nick Fisk
> > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > To: 'Samuel Just'
> > > Cc: ceph-***@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > Hi Sam,
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > Behalf Of Samuel Just
> > > > Sent: 18 August 2015 21:38
> > > > To: Nick Fisk <***@fisk.me.uk>
> > > > Cc: ceph-***@lists.ceph.com
> > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > > 1. We've kicked this around a bit. What kind of failure
> > > > semantics would
> > > you
> > > > be comfortable with here (that is, what would be reasonable
> > > > behavior if
> > > the
> > > > client side cache fails)?
> > >
> > > I would either expect to provide the cache with a redundant block
> > > device (ie
> > > RAID1 SSD's) or the cache to allow itself to be configured to mirror
> > > across two SSD's. Of course single SSD's can be used if the user
> > > accepts
> the
> > risk.
> > > If the cache did the mirroring then you could do fancy stuff like
> > > mirror the writes, but leave the read cache blocks as single copies
> > > to increase the cache capacity.
> > >
> > > In either case although an outage is undesirable, its only data loss
> > > which would be unacceptable, which would hopefully be avoided by the
> > > mirroring. As part of this, it would need to be a way to make sure a
> > > "dirty" RBD can't be accessed unless the corresponding cache is also
> > attached.
> > >
> > > I guess as it caching the RBD and not the pool or entire cluster,
> > > the cache only needs to match the failure requirements of the
> > > application
> its
> > caching.
> > > If I need to cache a RBD that is on a single server, there is no
> > > requirement to make the cache redundant across
> > racks/PDU's/servers...etc.
> > >
> > > I hope I've answered your question?
> > >
> > >
> > > > 2. We've got a branch which should merge soon (tomorrow probably)
> > > > which actually does allow writes to be proxied, so that should
> > > > alleviate some of these pain points somewhat. I'm not sure it is
> > > > clever enough to allow through writefulls for an ec base tier
> > > > though (but it would be a good
> > > idea!) -
> > >
> > > Excellent news, I shall look forward to testing in the future. I did
> > > mention the proxy write for write fulls to someone who was working
> > > on the proxy write code, but I'm not sure if it ever got followed up.
> >
> > I think someone here is me. In the current code, for an ec base tier,
> writefull
> > can be proxied to the base.
>
> Excellent news. Is this intelligent enough to determine when say a normal write
> IO from a RBD is equal to the underlying object size and then turn this normal
> write effectively into a write full?

Checked the code, seems we don't do this right now... Would this be much helpful? I think we can do this if the answer is yes.

>
> >
> > >
> > > > Sam
> > > >
> > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk> wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > >> Behalf Of Mark Nelson
> > > > >> Sent: 18 August 2015 18:51
> > > > >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer'
> > > > >> <***@schermer.cz>
> > > > >> Cc: ceph-***@lists.ceph.com
> > > > >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > >> > <snip>
> > > > >> >>>>
> > > > >> >>>> Here's kind of how I see the field right now:
> > > > >> >>>>
> > > > >> >>>> 1) Cache at the client level. Likely fastest but obvious
> > > > >> >>>> issues like
> > > > > above.
> > > > >> >>>> RAID1 might be an option at increased cost. Lack of
> > > > >> >>>> barriers in some implementations scary.
> > > > >> >>>
> > > > >> >>> Agreed.
> > > > >> >>>
> > > > >> >>>>
> > > > >> >>>> 2) Cache below the OSD. Not much recent data on this.
> > > > >> >>>> Not likely as fast as client side cache, but likely
> > > > >> >>>> cheaper (fewer OSD nodes than client
> > > > >> >> nodes?).
> > > > >> >>>> Lack of barriers in some implementations scary.
> > > > >> >>>
> > > > >> >>> This also has the benefit of caching the leveldb on the
> > > > >> >>> OSD, so get a big
> > > > >> >> performance gain from there too for small sequential writes.
> > > > >> >> I looked at using Flashcache for this too but decided it was
> > > > >> >> adding to much complexity and risk.
> > > > >> >>>
> > > > >> >>> I thought I read somewhere that RocksDB allows you to move
> > > > >> >>> its WAL to
> > > > >> >> SSD, is there anything in the pipeline for something like
> > > > >> >> moving the filestore to use RocksDB?
> > > > >> >>
> > > > >> >> I believe you can already do this, though I haven't tested it.
> > > > >> >> You can certainly move the monitors to rocksdb (tested) and
> > > > >> >> newstore uses
> > > > >> rocksdb as well.
> > > > >> >>
> > > > >> >
> > > > >> > Interesting, I might have a look into this.
> > > > >> >
> > > > >> >>>
> > > > >> >>>>
> > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > > >> >>>> amplification on promotion makes this primarily useful
> > > > >> >>>> when workloads fit mostly into the cache tier. Overall
> > > > >> >>>> safe design but care must be taken to not over-
> > > > >> >> promote.
> > > > >> >>>>
> > > > >> >>>> 4) separate SSD pool. Manual and not particularly
> > > > >> >>>> flexible, but perhaps
> > > > >> >> best
> > > > >> >>>> for applications that need consistently high performance.
> > > > >> >>>
> > > > >> >>> I think it depends on the definition of performance.
> > > > >> >>> Currently even very
> > > > >> >> fast CPU's and SSD's in their own pool will still struggle
> > > > >> >> to get less than 1ms of write latency. If your performance
> > > > >> >> requirements are for large queue depths then you will
> > > > >> >> probably be alright. If you require something that mirrors
> > > > >> >> the performance of traditional write back cache, then even
> > > > >> >> pure SSD Pools
> > > can start to struggle.
> > > > >> >>
> > > > >> >> Agreed. This is definitely the crux of the problem. The
> > > > >> >> example below is a great start! It'd would be fantastic if
> > > > >> >> we could get more feedback from the list on the relative
> > > > >> >> importance of low latency operations vs high IOPS through
> > > > >> >> concurrency. We have general suspicions but not a ton of
> > > > >> >> actual data regarding what folks are seeing in practice and
> under
> > what scenarios.
> > > > >> >>
> > > > >> >
> > > > >> > If you have any specific questions that you think I might be
> > > > >> > able to
> > > > > answer,
> > > > >> please let me know. The only other main app that I can really
> > > > >> think of
> > > > > where
> > > > >> these sort of write latency is critical is SQL, particularly
> > > > >> the
> > > > > transaction logs.
> > > > >>
> > > > >> Probably the big question is what are the pain points? The
> > > > >> most common answer we get when asking folks what applications
> > > > >> they run on top of Ceph is "everything!". This is wonderful,
> > > > >> but not helpful when trying to
> > > > > figure out
> > > > >> what performance issues matter most! :)
> > > > >
> > > > > Sort of like someone telling you their pc is broken and when
> > > > > asked for details getting "It's not working" in return.
> > > > >
> > > > > In general I think a lot of it comes down to people not
> > > > > appreciating the differences between Ceph and say a Raid array.
> > > > > For most things like larger block IO performance tends to scale
> > > > > with cluster size and the cost effectiveness of Ceph makes this
> > > > > a no brainer not to just add a handful of extra OSD's.
> > > > >
> > > > > I will try and be more precise. Here is my list of pain points /
> > > > > wishes that I have come across in the last 12 months of running
> Ceph.
> > > > >
> > > > > 1. Improve small IO write latency As discussed in depth in this
> > > > > thread. If it's possible just to make Ceph a lot faster then
> > > > > great, but I fear even a doubling in performance will still fall
> > > > > short compared to if you are caching writes at the client. Most
> > > > > things in Ceph tend to improve with scale, but write latency is
> > > > > the same with 2 OSD's as it is with 2000. I would urge some sort
> > > > > of investigation into the possibility of some sort of persistent
> > > > > librbd caching. This will probably help across a large number of
> > > > > scenarios, as in the end, most things are effected by latency
> > > and
> > > > I think will provide across the board improvements.
> > > > >
> > > > > 2. Cache Tiering
> > > > > I know a lot of work is going into this currently, but I will
> > > > > cover my experience.
> > > > > 2A)Deletion of large RBD's takes forever. It seems to have to
> > > > > promote all objects, even non-existent ones to the cache tier
> > > > > before it can
> > > delete
> > > > them.
> > > > > Operationally this is really poor as it has a negative effect on
> > > > > the cache tier contents as well.
> > > > > 2B) Erasure Coding requires all writes to be promoted 1st. I
> > > > > think it should be pretty easy to allow proxy writes for erasure
> > > > > coded pools if the IO size = Object Size. A lot of backup
> > > > > applications can be configured to write out in static sized
> > > > > blocks and would be an ideal candidate for this sort of enhancement.
> > > > > 2C) General Performance, hopefully this will be fixed by
> > > > > upcoming
> > > changes.
> > > > > 2D) Don't count consecutive sequential reads to the same object
> > > > > as a trigger for promotion. I currently have problems where
> > > > > reading sequentially through a large RBD, causes it to be
> > > > > completely promoted because the read IO size is smaller than the
> > > > > underlying object
> > > size.
> > > > >
> > > > > 3. Kernel RBD Client
> > > > > Either implement striping or see if it's possible to configure
> > > > > readahead
> > > > > +max_sectors_kb size to be larger than the object size. I
> > > > > +started a thread
> > > > > about this a few days ago if you are interested in more details.
> > > > >
> > > > > 4. Disk based OSD with SSD Journal performance As I touched on
> > > > > above earlier, I would expect a disk based OSD with SSD journal
> > > > > to have similar performance to a pure SSD OSD when dealing with
> > > > > sequential small IO's. Currently the levelDB sync and
> > > > > potentially other things slow this down.
> > > > >
> > > > > 5. iSCSI
> > > > > I know Mike Christie is doing a lot of good work in getting LIO
> > > > > to work with Ceph, but currently it feels like a bit of a
> > > > > amateur affair getting it going.
> > > > >
> > > > > 6. Slow xattr problem
> > > > > I've a weird problem a couple of times, where RBD's with data
> > > > > that hasn't been written to for a while seem to start performing
> > > > > reads very slowly. With the help of Somnath in a thread here we
> > > > > managed to track it down to a xattr taking very long to be
> > > > > retrieved, but no
> idea
> > why.
> > > > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > > > Hopefully Newstore might stop this happening in the future.
> > > > >
> > > > >>
> > > > >> IE, should we be focusing on IOPS? Latency? Finding a way to
> > > > >> avoid
> > > > > journal
> > > > >> overhead for large writes? Are there specific use cases where
> > > > >> we should specifically be focusing attention? general iscsi? S3?
> > > > >> databases directly on RBD? etc. There's tons of different
> > > > >> areas that we
> > > > > can
> > > > >> work on (general OSD threading improvements, different
> > > > >> messenger implementations, newstore, client side bottlenecks,
> > > > >> etc) but all of those things tackle different kinds of problems.
> > > > >>
> > > > >> Mark
> > > > >> _______________________________________________
> > > > >> ceph-users mailing list
> > > > >> ceph-***@lists.ceph.com
> > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-***@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-***@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

Nick Fisk

2015-09-01 08:37:00 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 09:18
> To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> > -----Original Message-----
> > From: Nick Fisk [mailto:***@fisk.me.uk]
> > Sent: Tuesday, September 1, 2015 3:55 PM
> > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > Cc: ceph-***@lists.ceph.com
> > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> >
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > Behalf Of Wang, Zhiqiang
> > > Sent: 01 September 2015 02:48
> > > To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> > > Cc: ceph-***@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > Behalf Of Nick Fisk
> > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > To: 'Samuel Just'
> > > > Cc: ceph-***@lists.ceph.com
> > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > > Hi Sam,
> > > >
> > > > > -----Original Message-----
> > > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > > Behalf Of Samuel Just
> > > > > Sent: 18 August 2015 21:38
> > > > > To: Nick Fisk <***@fisk.me.uk>
> > > > > Cc: ceph-***@lists.ceph.com
> > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >
> > > > > 1. We've kicked this around a bit. What kind of failure
> > > > > semantics would
> > > > you
> > > > > be comfortable with here (that is, what would be reasonable
> > > > > behavior if
> > > > the
> > > > > client side cache fails)?
> > > >
> > > > I would either expect to provide the cache with a redundant block
> > > > device (ie
> > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > mirror across two SSD's. Of course single SSD's can be used if the
> > > > user accepts
> > the
> > > risk.
> > > > If the cache did the mirroring then you could do fancy stuff like
> > > > mirror the writes, but leave the read cache blocks as single
> > > > copies to increase the cache capacity.
> > > >
> > > > In either case although an outage is undesirable, its only data
> > > > loss which would be unacceptable, which would hopefully be avoided
> > > > by the mirroring. As part of this, it would need to be a way to
> > > > make sure a "dirty" RBD can't be accessed unless the corresponding
> > > > cache is also
> > > attached.
> > > >
> > > > I guess as it caching the RBD and not the pool or entire cluster,
> > > > the cache only needs to match the failure requirements of the
> > > > application
> > its
> > > caching.
> > > > If I need to cache a RBD that is on a single server, there is no
> > > > requirement to make the cache redundant across
> > > racks/PDU's/servers...etc.
> > > >
> > > > I hope I've answered your question?
> > > >
> > > >
> > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > probably) which actually does allow writes to be proxied, so
> > > > > that should alleviate some of these pain points somewhat. I'm
> > > > > not sure it is clever enough to allow through writefulls for an
> > > > > ec base tier though (but it would be a good
> > > > idea!) -
> > > >
> > > > Excellent news, I shall look forward to testing in the future. I
> > > > did mention the proxy write for write fulls to someone who was
> > > > working on the proxy write code, but I'm not sure if it ever got
followed
> up.
> > >
> > > I think someone here is me. In the current code, for an ec base
> > > tier,
> > writefull
> > > can be proxied to the base.
> >
> > Excellent news. Is this intelligent enough to determine when say a
> > normal write IO from a RBD is equal to the underlying object size and
> > then turn this normal write effectively into a write full?
>
> Checked the code, seems we don't do this right now... Would this be much
> helpful? I think we can do this if the answer is yes.

Hopefully yes. Erasure code is very suited to storing backups capacity wise
and in a lot of backup software you can configure it to write in static size
blocks, which could be set to the object size. With the current tiering code
you end up with a lot of IO amplification and poor performance, if the above
feature was possible, it should perform a lot better.

Does that make sense?

If you are also caching the RBD, through some sort of block cache like
mentioned in this thread, then small sequential writes could also be
assembled in cache and then flushed straight through to the erasure tier as
proxy full writes. This is probably less appealing than the backup case but
gives the same advantages as RAID5/6 when equipped with a battery backed
cache, which also has massive performance gains when able to write a full
stripe.

>
> >
> > >
> > > >
> > > > > Sam
> > > > >
> > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk>
wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >> -----Original Message-----
> > > > > >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com]
> > > > > >> On Behalf Of Mark Nelson
> > > > > >> Sent: 18 August 2015 18:51
> > > > > >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer'
> > > > > >> <***@schermer.cz>
> > > > > >> Cc: ceph-***@lists.ceph.com
> > > > > >> Subject: Re: [ceph-users] any recommendation of using
> EnhanceIO?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > > >> > <snip>
> > > > > >> >>>>
> > > > > >> >>>> Here's kind of how I see the field right now:
> > > > > >> >>>>
> > > > > >> >>>> 1) Cache at the client level. Likely fastest but
> > > > > >> >>>> obvious issues like
> > > > > > above.
> > > > > >> >>>> RAID1 might be an option at increased cost. Lack of
> > > > > >> >>>> barriers in some implementations scary.
> > > > > >> >>>
> > > > > >> >>> Agreed.
> > > > > >> >>>
> > > > > >> >>>>
> > > > > >> >>>> 2) Cache below the OSD. Not much recent data on this.
> > > > > >> >>>> Not likely as fast as client side cache, but likely
> > > > > >> >>>> cheaper (fewer OSD nodes than client
> > > > > >> >> nodes?).
> > > > > >> >>>> Lack of barriers in some implementations scary.
> > > > > >> >>>
> > > > > >> >>> This also has the benefit of caching the leveldb on the
> > > > > >> >>> OSD, so get a big
> > > > > >> >> performance gain from there too for small sequential writes.
> > > > > >> >> I looked at using Flashcache for this too but decided it
> > > > > >> >> was adding to much complexity and risk.
> > > > > >> >>>
> > > > > >> >>> I thought I read somewhere that RocksDB allows you to
> > > > > >> >>> move its WAL to
> > > > > >> >> SSD, is there anything in the pipeline for something like
> > > > > >> >> moving the filestore to use RocksDB?
> > > > > >> >>
> > > > > >> >> I believe you can already do this, though I haven't tested
it.
> > > > > >> >> You can certainly move the monitors to rocksdb (tested)
> > > > > >> >> and newstore uses
> > > > > >> rocksdb as well.
> > > > > >> >>
> > > > > >> >
> > > > > >> > Interesting, I might have a look into this.
> > > > > >> >
> > > > > >> >>>
> > > > > >> >>>>
> > > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > > > >> >>>> amplification on promotion makes this primarily useful
> > > > > >> >>>> when workloads fit mostly into the cache tier. Overall
> > > > > >> >>>> safe design but care must be taken to not over-
> > > > > >> >> promote.
> > > > > >> >>>>
> > > > > >> >>>> 4) separate SSD pool. Manual and not particularly
> > > > > >> >>>> flexible, but perhaps
> > > > > >> >> best
> > > > > >> >>>> for applications that need consistently high performance.
> > > > > >> >>>
> > > > > >> >>> I think it depends on the definition of performance.
> > > > > >> >>> Currently even very
> > > > > >> >> fast CPU's and SSD's in their own pool will still struggle
> > > > > >> >> to get less than 1ms of write latency. If your performance
> > > > > >> >> requirements are for large queue depths then you will
> > > > > >> >> probably be alright. If you require something that mirrors
> > > > > >> >> the performance of traditional write back cache, then even
> > > > > >> >> pure SSD Pools
> > > > can start to struggle.
> > > > > >> >>
> > > > > >> >> Agreed. This is definitely the crux of the problem. The
> > > > > >> >> example below is a great start! It'd would be fantastic
> > > > > >> >> if we could get more feedback from the list on the
> > > > > >> >> relative importance of low latency operations vs high IOPS
> > > > > >> >> through concurrency. We have general suspicions but not a
> > > > > >> >> ton of actual data regarding what folks are seeing in
> > > > > >> >> practice and
> > under
> > > what scenarios.
> > > > > >> >>
> > > > > >> >
> > > > > >> > If you have any specific questions that you think I might
> > > > > >> > be able to
> > > > > > answer,
> > > > > >> please let me know. The only other main app that I can really
> > > > > >> think of
> > > > > > where
> > > > > >> these sort of write latency is critical is SQL, particularly
> > > > > >> the
> > > > > > transaction logs.
> > > > > >>
> > > > > >> Probably the big question is what are the pain points? The
> > > > > >> most common answer we get when asking folks what applications
> > > > > >> they run on top of Ceph is "everything!". This is wonderful,
> > > > > >> but not helpful when trying to
> > > > > > figure out
> > > > > >> what performance issues matter most! :)
> > > > > >
> > > > > > Sort of like someone telling you their pc is broken and when
> > > > > > asked for details getting "It's not working" in return.
> > > > > >
> > > > > > In general I think a lot of it comes down to people not
> > > > > > appreciating the differences between Ceph and say a Raid array.
> > > > > > For most things like larger block IO performance tends to
> > > > > > scale with cluster size and the cost effectiveness of Ceph
> > > > > > makes this a no brainer not to just add a handful of extra
OSD's.
> > > > > >
> > > > > > I will try and be more precise. Here is my list of pain points
> > > > > > / wishes that I have come across in the last 12 months of
> > > > > > running
> > Ceph.
> > > > > >
> > > > > > 1. Improve small IO write latency As discussed in depth in
> > > > > > this thread. If it's possible just to make Ceph a lot faster
> > > > > > then great, but I fear even a doubling in performance will
> > > > > > still fall short compared to if you are caching writes at the
> > > > > > client. Most things in Ceph tend to improve with scale, but
> > > > > > write latency is the same with 2 OSD's as it is with 2000. I
> > > > > > would urge some sort of investigation into the possibility of
> > > > > > some sort of persistent librbd caching. This will probably
> > > > > > help across a large number of scenarios, as in the end, most
> > > > > > things are effected by latency
> > > > and
> > > > > I think will provide across the board improvements.
> > > > > >
> > > > > > 2. Cache Tiering
> > > > > > I know a lot of work is going into this currently, but I will
> > > > > > cover my experience.
> > > > > > 2A)Deletion of large RBD's takes forever. It seems to have to
> > > > > > promote all objects, even non-existent ones to the cache tier
> > > > > > before it can
> > > > delete
> > > > > them.
> > > > > > Operationally this is really poor as it has a negative effect
> > > > > > on the cache tier contents as well.
> > > > > > 2B) Erasure Coding requires all writes to be promoted 1st. I
> > > > > > think it should be pretty easy to allow proxy writes for
> > > > > > erasure coded pools if the IO size = Object Size. A lot of
> > > > > > backup applications can be configured to write out in static
> > > > > > sized blocks and would be an ideal candidate for this sort of
> enhancement.
> > > > > > 2C) General Performance, hopefully this will be fixed by
> > > > > > upcoming
> > > > changes.
> > > > > > 2D) Don't count consecutive sequential reads to the same
> > > > > > object as a trigger for promotion. I currently have problems
> > > > > > where reading sequentially through a large RBD, causes it to
> > > > > > be completely promoted because the read IO size is smaller
> > > > > > than the underlying object
> > > > size.
> > > > > >
> > > > > > 3. Kernel RBD Client
> > > > > > Either implement striping or see if it's possible to configure
> > > > > > readahead
> > > > > > +max_sectors_kb size to be larger than the object size. I
> > > > > > +started a thread
> > > > > > about this a few days ago if you are interested in more details.
> > > > > >
> > > > > > 4. Disk based OSD with SSD Journal performance As I touched on
> > > > > > above earlier, I would expect a disk based OSD with SSD
> > > > > > journal to have similar performance to a pure SSD OSD when
> > > > > > dealing with sequential small IO's. Currently the levelDB sync
> > > > > > and potentially other things slow this down.
> > > > > >
> > > > > > 5. iSCSI
> > > > > > I know Mike Christie is doing a lot of good work in getting
> > > > > > LIO to work with Ceph, but currently it feels like a bit of a
> > > > > > amateur affair getting it going.
> > > > > >
> > > > > > 6. Slow xattr problem
> > > > > > I've a weird problem a couple of times, where RBD's with data
> > > > > > that hasn't been written to for a while seem to start
> > > > > > performing reads very slowly. With the help of Somnath in a
> > > > > > thread here we managed to track it down to a xattr taking very
> > > > > > long to be retrieved, but no
> > idea
> > > why.
> > > > > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > > > > Hopefully Newstore might stop this happening in the future.
> > > > > >
> > > > > >>
> > > > > >> IE, should we be focusing on IOPS? Latency? Finding a way
> > > > > >> to avoid
> > > > > > journal
> > > > > >> overhead for large writes? Are there specific use cases
> > > > > >> where we should specifically be focusing attention? general
iscsi?
> S3?
> > > > > >> databases directly on RBD? etc. There's tons of different
> > > > > >> areas that we
> > > > > > can
> > > > > >> work on (general OSD threading improvements, different
> > > > > >> messenger implementations, newstore, client side bottlenecks,
> > > > > >> etc) but all of those things tackle different kinds of
problems.
> > > > > >>
> > > > > >> Mark
> > > > > >> _______________________________________________
> > > > > >> ceph-users mailing list
> > > > > >> ceph-***@lists.ceph.com
> > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-***@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-***@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-***@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Wang, Zhiqiang

2015-09-01 08:48:10 UTC

Permalink

> -----Original Message-----
> From: Nick Fisk [mailto:***@fisk.me.uk]
> Sent: Tuesday, September 1, 2015 4:37 PM
> To: Wang, Zhiqiang; 'Samuel Just'
> Cc: ceph-***@lists.ceph.com
> Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
>
>
>
>
>
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> > Of Wang, Zhiqiang
> > Sent: 01 September 2015 09:18
> > To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> > Cc: ceph-***@lists.ceph.com
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > > -----Original Message-----
> > > From: Nick Fisk [mailto:***@fisk.me.uk]
> > > Sent: Tuesday, September 1, 2015 3:55 PM
> > > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > > Cc: ceph-***@lists.ceph.com
> > > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > >
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > Behalf Of Wang, Zhiqiang
> > > > Sent: 01 September 2015 02:48
> > > > To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> > > > Cc: ceph-***@lists.ceph.com
> > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > > > -----Original Message-----
> > > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > > Behalf Of Nick Fisk
> > > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > > To: 'Samuel Just'
> > > > > Cc: ceph-***@lists.ceph.com
> > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >
> > > > > Hi Sam,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > > > Behalf Of Samuel Just
> > > > > > Sent: 18 August 2015 21:38
> > > > > > To: Nick Fisk <***@fisk.me.uk>
> > > > > > Cc: ceph-***@lists.ceph.com
> > > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > > >
> > > > > > 1. We've kicked this around a bit. What kind of failure
> > > > > > semantics would
> > > > > you
> > > > > > be comfortable with here (that is, what would be reasonable
> > > > > > behavior if
> > > > > the
> > > > > > client side cache fails)?
> > > > >
> > > > > I would either expect to provide the cache with a redundant
> > > > > block device (ie
> > > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > > mirror across two SSD's. Of course single SSD's can be used if
> > > > > the user accepts
> > > the
> > > > risk.
> > > > > If the cache did the mirroring then you could do fancy stuff
> > > > > like mirror the writes, but leave the read cache blocks as
> > > > > single copies to increase the cache capacity.
> > > > >
> > > > > In either case although an outage is undesirable, its only data
> > > > > loss which would be unacceptable, which would hopefully be
> > > > > avoided by the mirroring. As part of this, it would need to be a
> > > > > way to make sure a "dirty" RBD can't be accessed unless the
> > > > > corresponding cache is also
> > > > attached.
> > > > >
> > > > > I guess as it caching the RBD and not the pool or entire
> > > > > cluster, the cache only needs to match the failure requirements
> > > > > of the application
> > > its
> > > > caching.
> > > > > If I need to cache a RBD that is on a single server, there is
> > > > > no requirement to make the cache redundant across
> > > > racks/PDU's/servers...etc.
> > > > >
> > > > > I hope I've answered your question?
> > > > >
> > > > >
> > > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > > probably) which actually does allow writes to be proxied, so
> > > > > > that should alleviate some of these pain points somewhat. I'm
> > > > > > not sure it is clever enough to allow through writefulls for
> > > > > > an ec base tier though (but it would be a good
> > > > > idea!) -
> > > > >
> > > > > Excellent news, I shall look forward to testing in the future. I
> > > > > did mention the proxy write for write fulls to someone who was
> > > > > working on the proxy write code, but I'm not sure if it ever got
> followed
> > up.
> > > >
> > > > I think someone here is me. In the current code, for an ec base
> > > > tier,
> > > writefull
> > > > can be proxied to the base.
> > >
> > > Excellent news. Is this intelligent enough to determine when say a
> > > normal write IO from a RBD is equal to the underlying object size
> > > and then turn this normal write effectively into a write full?
> >
> > Checked the code, seems we don't do this right now... Would this be
> > much helpful? I think we can do this if the answer is yes.
>
> Hopefully yes. Erasure code is very suited to storing backups capacity wise and
> in a lot of backup software you can configure it to write in static size blocks,
> which could be set to the object size. With the current tiering code you end up
> with a lot of IO amplification and poor performance, if the above feature was
> possible, it should perform a lot better.
>
> Does that make sense?

Yep. It makes sense in this case. Actually, the backup software doesn't need to write in units of object size. As long as it spans a full object, then this object can be written in writefull. I'll see if I can come up with an implementation of this.

>
> If you are also caching the RBD, through some sort of block cache like
> mentioned in this thread, then small sequential writes could also be assembled
> in cache and then flushed straight through to the erasure tier as proxy full
> writes. This is probably less appealing than the backup case but gives the same
> advantages as RAID5/6 when equipped with a battery backed cache, which
> also has massive performance gains when able to write a full stripe.
>
> >
> > >
> > > >
> > > > >
> > > > > > Sam
> > > > > >
> > > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk <***@fisk.me.uk>
> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >> -----Original Message-----
> > > > > > >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com]
> > > > > > >> On Behalf Of Mark Nelson
> > > > > > >> Sent: 18 August 2015 18:51
> > > > > > >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer'
> > > > > > >> <***@schermer.cz>
> > > > > > >> Cc: ceph-***@lists.ceph.com
> > > > > > >> Subject: Re: [ceph-users] any recommendation of using
> > EnhanceIO?
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > > > >> > <snip>
> > > > > > >> >>>>
> > > > > > >> >>>> Here's kind of how I see the field right now:
> > > > > > >> >>>>
> > > > > > >> >>>> 1) Cache at the client level. Likely fastest but
> > > > > > >> >>>> obvious issues like
> > > > > > > above.
> > > > > > >> >>>> RAID1 might be an option at increased cost. Lack of
> > > > > > >> >>>> barriers in some implementations scary.
> > > > > > >> >>>
> > > > > > >> >>> Agreed.
> > > > > > >> >>>
> > > > > > >> >>>>
> > > > > > >> >>>> 2) Cache below the OSD. Not much recent data on this.
> > > > > > >> >>>> Not likely as fast as client side cache, but likely
> > > > > > >> >>>> cheaper (fewer OSD nodes than client
> > > > > > >> >> nodes?).
> > > > > > >> >>>> Lack of barriers in some implementations scary.
> > > > > > >> >>>
> > > > > > >> >>> This also has the benefit of caching the leveldb on the
> > > > > > >> >>> OSD, so get a big
> > > > > > >> >> performance gain from there too for small sequential writes.
> > > > > > >> >> I looked at using Flashcache for this too but decided it
> > > > > > >> >> was adding to much complexity and risk.
> > > > > > >> >>>
> > > > > > >> >>> I thought I read somewhere that RocksDB allows you to
> > > > > > >> >>> move its WAL to
> > > > > > >> >> SSD, is there anything in the pipeline for something
> > > > > > >> >> like moving the filestore to use RocksDB?
> > > > > > >> >>
> > > > > > >> >> I believe you can already do this, though I haven't
> > > > > > >> >> tested
> it.
> > > > > > >> >> You can certainly move the monitors to rocksdb (tested)
> > > > > > >> >> and newstore uses
> > > > > > >> rocksdb as well.
> > > > > > >> >>
> > > > > > >> >
> > > > > > >> > Interesting, I might have a look into this.
> > > > > > >> >
> > > > > > >> >>>
> > > > > > >> >>>>
> > > > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > > > > >> >>>> amplification on promotion makes this primarily useful
> > > > > > >> >>>> when workloads fit mostly into the cache tier.
> > > > > > >> >>>> Overall safe design but care must be taken to not
> > > > > > >> >>>> over-
> > > > > > >> >> promote.
> > > > > > >> >>>>
> > > > > > >> >>>> 4) separate SSD pool. Manual and not particularly
> > > > > > >> >>>> flexible, but perhaps
> > > > > > >> >> best
> > > > > > >> >>>> for applications that need consistently high performance.
> > > > > > >> >>>
> > > > > > >> >>> I think it depends on the definition of performance.
> > > > > > >> >>> Currently even very
> > > > > > >> >> fast CPU's and SSD's in their own pool will still
> > > > > > >> >> struggle to get less than 1ms of write latency. If your
> > > > > > >> >> performance requirements are for large queue depths then
> > > > > > >> >> you will probably be alright. If you require something
> > > > > > >> >> that mirrors the performance of traditional write back
> > > > > > >> >> cache, then even pure SSD Pools
> > > > > can start to struggle.
> > > > > > >> >>
> > > > > > >> >> Agreed. This is definitely the crux of the problem.
> > > > > > >> >> The example below is a great start! It'd would be
> > > > > > >> >> fantastic if we could get more feedback from the list on
> > > > > > >> >> the relative importance of low latency operations vs
> > > > > > >> >> high IOPS through concurrency. We have general
> > > > > > >> >> suspicions but not a ton of actual data regarding what
> > > > > > >> >> folks are seeing in practice and
> > > under
> > > > what scenarios.
> > > > > > >> >>
> > > > > > >> >
> > > > > > >> > If you have any specific questions that you think I might
> > > > > > >> > be able to
> > > > > > > answer,
> > > > > > >> please let me know. The only other main app that I can
> > > > > > >> really think of
> > > > > > > where
> > > > > > >> these sort of write latency is critical is SQL,
> > > > > > >> particularly the
> > > > > > > transaction logs.
> > > > > > >>
> > > > > > >> Probably the big question is what are the pain points? The
> > > > > > >> most common answer we get when asking folks what
> > > > > > >> applications they run on top of Ceph is "everything!".
> > > > > > >> This is wonderful, but not helpful when trying to
> > > > > > > figure out
> > > > > > >> what performance issues matter most! :)
> > > > > > >
> > > > > > > Sort of like someone telling you their pc is broken and when
> > > > > > > asked for details getting "It's not working" in return.
> > > > > > >
> > > > > > > In general I think a lot of it comes down to people not
> > > > > > > appreciating the differences between Ceph and say a Raid array.
> > > > > > > For most things like larger block IO performance tends to
> > > > > > > scale with cluster size and the cost effectiveness of Ceph
> > > > > > > makes this a no brainer not to just add a handful of extra
> OSD's.
> > > > > > >
> > > > > > > I will try and be more precise. Here is my list of pain
> > > > > > > points / wishes that I have come across in the last 12
> > > > > > > months of running
> > > Ceph.
> > > > > > >
> > > > > > > 1. Improve small IO write latency As discussed in depth in
> > > > > > > this thread. If it's possible just to make Ceph a lot faster
> > > > > > > then great, but I fear even a doubling in performance will
> > > > > > > still fall short compared to if you are caching writes at
> > > > > > > the client. Most things in Ceph tend to improve with scale,
> > > > > > > but write latency is the same with 2 OSD's as it is with
> > > > > > > 2000. I would urge some sort of investigation into the
> > > > > > > possibility of some sort of persistent librbd caching. This
> > > > > > > will probably help across a large number of scenarios, as in
> > > > > > > the end, most things are effected by latency
> > > > > and
> > > > > > I think will provide across the board improvements.
> > > > > > >
> > > > > > > 2. Cache Tiering
> > > > > > > I know a lot of work is going into this currently, but I
> > > > > > > will cover my experience.
> > > > > > > 2A)Deletion of large RBD's takes forever. It seems to have
> > > > > > > to promote all objects, even non-existent ones to the cache
> > > > > > > tier before it can
> > > > > delete
> > > > > > them.
> > > > > > > Operationally this is really poor as it has a negative
> > > > > > > effect on the cache tier contents as well.
> > > > > > > 2B) Erasure Coding requires all writes to be promoted 1st. I
> > > > > > > think it should be pretty easy to allow proxy writes for
> > > > > > > erasure coded pools if the IO size = Object Size. A lot of
> > > > > > > backup applications can be configured to write out in static
> > > > > > > sized blocks and would be an ideal candidate for this sort
> > > > > > > of
> > enhancement.
> > > > > > > 2C) General Performance, hopefully this will be fixed by
> > > > > > > upcoming
> > > > > changes.
> > > > > > > 2D) Don't count consecutive sequential reads to the same
> > > > > > > object as a trigger for promotion. I currently have problems
> > > > > > > where reading sequentially through a large RBD, causes it to
> > > > > > > be completely promoted because the read IO size is smaller
> > > > > > > than the underlying object
> > > > > size.
> > > > > > >
> > > > > > > 3. Kernel RBD Client
> > > > > > > Either implement striping or see if it's possible to
> > > > > > > configure readahead
> > > > > > > +max_sectors_kb size to be larger than the object size. I
> > > > > > > +started a thread
> > > > > > > about this a few days ago if you are interested in more details.
> > > > > > >
> > > > > > > 4. Disk based OSD with SSD Journal performance As I touched
> > > > > > > on above earlier, I would expect a disk based OSD with SSD
> > > > > > > journal to have similar performance to a pure SSD OSD when
> > > > > > > dealing with sequential small IO's. Currently the levelDB
> > > > > > > sync and potentially other things slow this down.
> > > > > > >
> > > > > > > 5. iSCSI
> > > > > > > I know Mike Christie is doing a lot of good work in getting
> > > > > > > LIO to work with Ceph, but currently it feels like a bit of
> > > > > > > a amateur affair getting it going.
> > > > > > >
> > > > > > > 6. Slow xattr problem
> > > > > > > I've a weird problem a couple of times, where RBD's with
> > > > > > > data that hasn't been written to for a while seem to start
> > > > > > > performing reads very slowly. With the help of Somnath in a
> > > > > > > thread here we managed to track it down to a xattr taking
> > > > > > > very long to be retrieved, but no
> > > idea
> > > > why.
> > > > > > > Overwriting the RBD with fresh data seemed to stop it happening.
> > > > > > > Hopefully Newstore might stop this happening in the future.
> > > > > > >
> > > > > > >>
> > > > > > >> IE, should we be focusing on IOPS? Latency? Finding a way
> > > > > > >> to avoid
> > > > > > > journal
> > > > > > >> overhead for large writes? Are there specific use cases
> > > > > > >> where we should specifically be focusing attention? general
> iscsi?
> > S3?
> > > > > > >> databases directly on RBD? etc. There's tons of different
> > > > > > >> areas that we
> > > > > > > can
> > > > > > >> work on (general OSD threading improvements, different
> > > > > > >> messenger implementations, newstore, client side
> > > > > > >> bottlenecks,
> > > > > > >> etc) but all of those things tackle different kinds of
> problems.
> > > > > > >>
> > > > > > >> Mark
> > > > > > >> _______________________________________________
> > > > > > >> ceph-users mailing list
> > > > > > >> ceph-***@lists.ceph.com
> > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-***@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-***@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-***@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-***@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

Nick Fisk

2015-09-01 09:07:43 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Wang, Zhiqiang
> Sent: 01 September 2015 09:48
> To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> > -----Original Message-----
> > From: Nick Fisk [mailto:***@fisk.me.uk]
> > Sent: Tuesday, September 1, 2015 4:37 PM
> > To: Wang, Zhiqiang; 'Samuel Just'
> > Cc: ceph-***@lists.ceph.com
> > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> >
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > Behalf Of Wang, Zhiqiang
> > > Sent: 01 September 2015 09:18
> > > To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just' <***@redhat.com>
> > > Cc: ceph-***@lists.ceph.com
> > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > >
> > > > -----Original Message-----
> > > > From: Nick Fisk [mailto:***@fisk.me.uk]
> > > > Sent: Tuesday, September 1, 2015 3:55 PM
> > > > To: Wang, Zhiqiang; 'Nick Fisk'; 'Samuel Just'
> > > > Cc: ceph-***@lists.ceph.com
> > > > Subject: RE: [ceph-users] any recommendation of using EnhanceIO?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > > Behalf Of Wang, Zhiqiang
> > > > > Sent: 01 September 2015 02:48
> > > > > To: Nick Fisk <***@fisk.me.uk>; 'Samuel Just'
> > > > > <***@redhat.com>
> > > > > Cc: ceph-***@lists.ceph.com
> > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> > > > > > Behalf Of Nick Fisk
> > > > > > Sent: Wednesday, August 19, 2015 5:25 AM
> > > > > > To: 'Samuel Just'
> > > > > > Cc: ceph-***@lists.ceph.com
> > > > > > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> > > > > >
> > > > > > Hi Sam,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: ceph-users [mailto:ceph-users-***@lists.ceph.com]
> > > > > > > On Behalf Of Samuel Just
> > > > > > > Sent: 18 August 2015 21:38
> > > > > > > To: Nick Fisk <***@fisk.me.uk>
> > > > > > > Cc: ceph-***@lists.ceph.com
> > > > > > > Subject: Re: [ceph-users] any recommendation of using
> EnhanceIO?
> > > > > > >
> > > > > > > 1. We've kicked this around a bit. What kind of failure
> > > > > > > semantics would
> > > > > > you
> > > > > > > be comfortable with here (that is, what would be reasonable
> > > > > > > behavior if
> > > > > > the
> > > > > > > client side cache fails)?
> > > > > >
> > > > > > I would either expect to provide the cache with a redundant
> > > > > > block device (ie
> > > > > > RAID1 SSD's) or the cache to allow itself to be configured to
> > > > > > mirror across two SSD's. Of course single SSD's can be used if
> > > > > > the user accepts
> > > > the
> > > > > risk.
> > > > > > If the cache did the mirroring then you could do fancy stuff
> > > > > > like mirror the writes, but leave the read cache blocks as
> > > > > > single copies to increase the cache capacity.
> > > > > >
> > > > > > In either case although an outage is undesirable, its only
> > > > > > data loss which would be unacceptable, which would hopefully
> > > > > > be avoided by the mirroring. As part of this, it would need to
> > > > > > be a way to make sure a "dirty" RBD can't be accessed unless
> > > > > > the corresponding cache is also
> > > > > attached.
> > > > > >
> > > > > > I guess as it caching the RBD and not the pool or entire
> > > > > > cluster, the cache only needs to match the failure
> > > > > > requirements of the application
> > > > its
> > > > > caching.
> > > > > > If I need to cache a RBD that is on a single server, there is
> > > > > > no requirement to make the cache redundant across
> > > > > racks/PDU's/servers...etc.
> > > > > >
> > > > > > I hope I've answered your question?
> > > > > >
> > > > > >
> > > > > > > 2. We've got a branch which should merge soon (tomorrow
> > > > > > > probably) which actually does allow writes to be proxied, so
> > > > > > > that should alleviate some of these pain points somewhat.
> > > > > > > I'm not sure it is clever enough to allow through writefulls
> > > > > > > for an ec base tier though (but it would be a good
> > > > > > idea!) -
> > > > > >
> > > > > > Excellent news, I shall look forward to testing in the future.
> > > > > > I did mention the proxy write for write fulls to someone who
> > > > > > was working on the proxy write code, but I'm not sure if it
> > > > > > ever got
> > followed
> > > up.
> > > > >
> > > > > I think someone here is me. In the current code, for an ec base
> > > > > tier,
> > > > writefull
> > > > > can be proxied to the base.
> > > >
> > > > Excellent news. Is this intelligent enough to determine when say a
> > > > normal write IO from a RBD is equal to the underlying object size
> > > > and then turn this normal write effectively into a write full?
> > >
> > > Checked the code, seems we don't do this right now... Would this be
> > > much helpful? I think we can do this if the answer is yes.
> >
> > Hopefully yes. Erasure code is very suited to storing backups capacity
> > wise and in a lot of backup software you can configure it to write in
> > static size blocks, which could be set to the object size. With the
> > current tiering code you end up with a lot of IO amplification and
> > poor performance, if the above feature was possible, it should perform a
> lot better.
> >
> > Does that make sense?
>
> Yep. It makes sense in this case. Actually, the backup software doesn't
need
> to write in units of object size. As long as it spans a full object, then
this
> object can be written in writefull. I'll see if I can come up with an
> implementation of this.

Awesome, thanks for your interest in this.

>
> >
> > If you are also caching the RBD, through some sort of block cache like
> > mentioned in this thread, then small sequential writes could also be
> > assembled in cache and then flushed straight through to the erasure
> > tier as proxy full writes. This is probably less appealing than the
> > backup case but gives the same advantages as RAID5/6 when equipped
> > with a battery backed cache, which also has massive performance gains
> when able to write a full stripe.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > Sam
> > > > > > >
> > > > > > > On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk
> > > > > > > <***@fisk.me.uk>
> > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >> -----Original Message-----
> > > > > > > >> From: ceph-users
> > > > > > > >> [mailto:ceph-users-***@lists.ceph.com]
> > > > > > > >> On Behalf Of Mark Nelson
> > > > > > > >> Sent: 18 August 2015 18:51
> > > > > > > >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer'
> > > > > > > >> <***@schermer.cz>
> > > > > > > >> Cc: ceph-***@lists.ceph.com
> > > > > > > >> Subject: Re: [ceph-users] any recommendation of using
> > > EnhanceIO?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 08/18/2015 11:52 AM, Nick Fisk wrote:
> > > > > > > >> > <snip>
> > > > > > > >> >>>>
> > > > > > > >> >>>> Here's kind of how I see the field right now:
> > > > > > > >> >>>>
> > > > > > > >> >>>> 1) Cache at the client level. Likely fastest but
> > > > > > > >> >>>> obvious issues like
> > > > > > > > above.
> > > > > > > >> >>>> RAID1 might be an option at increased cost. Lack of
> > > > > > > >> >>>> barriers in some implementations scary.
> > > > > > > >> >>>
> > > > > > > >> >>> Agreed.
> > > > > > > >> >>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> 2) Cache below the OSD. Not much recent data on this.
> > > > > > > >> >>>> Not likely as fast as client side cache, but likely
> > > > > > > >> >>>> cheaper (fewer OSD nodes than client
> > > > > > > >> >> nodes?).
> > > > > > > >> >>>> Lack of barriers in some implementations scary.
> > > > > > > >> >>>
> > > > > > > >> >>> This also has the benefit of caching the leveldb on
> > > > > > > >> >>> the OSD, so get a big
> > > > > > > >> >> performance gain from there too for small sequential
> writes.
> > > > > > > >> >> I looked at using Flashcache for this too but decided
> > > > > > > >> >> it was adding to much complexity and risk.
> > > > > > > >> >>>
> > > > > > > >> >>> I thought I read somewhere that RocksDB allows you to
> > > > > > > >> >>> move its WAL to
> > > > > > > >> >> SSD, is there anything in the pipeline for something
> > > > > > > >> >> like moving the filestore to use RocksDB?
> > > > > > > >> >>
> > > > > > > >> >> I believe you can already do this, though I haven't
> > > > > > > >> >> tested
> > it.
> > > > > > > >> >> You can certainly move the monitors to rocksdb
> > > > > > > >> >> (tested) and newstore uses
> > > > > > > >> rocksdb as well.
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> > Interesting, I might have a look into this.
> > > > > > > >> >
> > > > > > > >> >>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> 3) Ceph Cache Tiering. Network overhead and write
> > > > > > > >> >>>> amplification on promotion makes this primarily
> > > > > > > >> >>>> useful when workloads fit mostly into the cache tier.
> > > > > > > >> >>>> Overall safe design but care must be taken to not
> > > > > > > >> >>>> over-
> > > > > > > >> >> promote.
> > > > > > > >> >>>>
> > > > > > > >> >>>> 4) separate SSD pool. Manual and not particularly
> > > > > > > >> >>>> flexible, but perhaps
> > > > > > > >> >> best
> > > > > > > >> >>>> for applications that need consistently high
performance.
> > > > > > > >> >>>
> > > > > > > >> >>> I think it depends on the definition of performance.
> > > > > > > >> >>> Currently even very
> > > > > > > >> >> fast CPU's and SSD's in their own pool will still
> > > > > > > >> >> struggle to get less than 1ms of write latency. If
> > > > > > > >> >> your performance requirements are for large queue
> > > > > > > >> >> depths then you will probably be alright. If you
> > > > > > > >> >> require something that mirrors the performance of
> > > > > > > >> >> traditional write back cache, then even pure SSD Pools
> > > > > > can start to struggle.
> > > > > > > >> >>
> > > > > > > >> >> Agreed. This is definitely the crux of the problem.
> > > > > > > >> >> The example below is a great start! It'd would be
> > > > > > > >> >> fantastic if we could get more feedback from the list
> > > > > > > >> >> on the relative importance of low latency operations
> > > > > > > >> >> vs high IOPS through concurrency. We have general
> > > > > > > >> >> suspicions but not a ton of actual data regarding what
> > > > > > > >> >> folks are seeing in practice and
> > > > under
> > > > > what scenarios.
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> > If you have any specific questions that you think I
> > > > > > > >> > might be able to
> > > > > > > > answer,
> > > > > > > >> please let me know. The only other main app that I can
> > > > > > > >> really think of
> > > > > > > > where
> > > > > > > >> these sort of write latency is critical is SQL,
> > > > > > > >> particularly the
> > > > > > > > transaction logs.
> > > > > > > >>
> > > > > > > >> Probably the big question is what are the pain points?
> > > > > > > >> The most common answer we get when asking folks what
> > > > > > > >> applications they run on top of Ceph is "everything!".
> > > > > > > >> This is wonderful, but not helpful when trying to
> > > > > > > > figure out
> > > > > > > >> what performance issues matter most! :)
> > > > > > > >
> > > > > > > > Sort of like someone telling you their pc is broken and
> > > > > > > > when asked for details getting "It's not working" in return.
> > > > > > > >
> > > > > > > > In general I think a lot of it comes down to people not
> > > > > > > > appreciating the differences between Ceph and say a Raid
array.
> > > > > > > > For most things like larger block IO performance tends to
> > > > > > > > scale with cluster size and the cost effectiveness of Ceph
> > > > > > > > makes this a no brainer not to just add a handful of extra
> > OSD's.
> > > > > > > >
> > > > > > > > I will try and be more precise. Here is my list of pain
> > > > > > > > points / wishes that I have come across in the last 12
> > > > > > > > months of running
> > > > Ceph.
> > > > > > > >
> > > > > > > > 1. Improve small IO write latency As discussed in depth in
> > > > > > > > this thread. If it's possible just to make Ceph a lot
> > > > > > > > faster then great, but I fear even a doubling in
> > > > > > > > performance will still fall short compared to if you are
> > > > > > > > caching writes at the client. Most things in Ceph tend to
> > > > > > > > improve with scale, but write latency is the same with 2
> > > > > > > > OSD's as it is with 2000. I would urge some sort of
> > > > > > > > investigation into the possibility of some sort of
> > > > > > > > persistent librbd caching. This will probably help across
> > > > > > > > a large number of scenarios, as in the end, most things
> > > > > > > > are effected by latency
> > > > > > and
> > > > > > > I think will provide across the board improvements.
> > > > > > > >
> > > > > > > > 2. Cache Tiering
> > > > > > > > I know a lot of work is going into this currently, but I
> > > > > > > > will cover my experience.
> > > > > > > > 2A)Deletion of large RBD's takes forever. It seems to have
> > > > > > > > to promote all objects, even non-existent ones to the
> > > > > > > > cache tier before it can
> > > > > > delete
> > > > > > > them.
> > > > > > > > Operationally this is really poor as it has a negative
> > > > > > > > effect on the cache tier contents as well.
> > > > > > > > 2B) Erasure Coding requires all writes to be promoted 1st.
> > > > > > > > I think it should be pretty easy to allow proxy writes for
> > > > > > > > erasure coded pools if the IO size = Object Size. A lot of
> > > > > > > > backup applications can be configured to write out in
> > > > > > > > static sized blocks and would be an ideal candidate for
> > > > > > > > this sort of
> > > enhancement.
> > > > > > > > 2C) General Performance, hopefully this will be fixed by
> > > > > > > > upcoming
> > > > > > changes.
> > > > > > > > 2D) Don't count consecutive sequential reads to the same
> > > > > > > > object as a trigger for promotion. I currently have
> > > > > > > > problems where reading sequentially through a large RBD,
> > > > > > > > causes it to be completely promoted because the read IO
> > > > > > > > size is smaller than the underlying object
> > > > > > size.
> > > > > > > >
> > > > > > > > 3. Kernel RBD Client
> > > > > > > > Either implement striping or see if it's possible to
> > > > > > > > configure readahead
> > > > > > > > +max_sectors_kb size to be larger than the object size. I
> > > > > > > > +started a thread
> > > > > > > > about this a few days ago if you are interested in more
details.
> > > > > > > >
> > > > > > > > 4. Disk based OSD with SSD Journal performance As I
> > > > > > > > touched on above earlier, I would expect a disk based OSD
> > > > > > > > with SSD journal to have similar performance to a pure SSD
> > > > > > > > OSD when dealing with sequential small IO's. Currently the
> > > > > > > > levelDB sync and potentially other things slow this down.
> > > > > > > >
> > > > > > > > 5. iSCSI
> > > > > > > > I know Mike Christie is doing a lot of good work in
> > > > > > > > getting LIO to work with Ceph, but currently it feels like
> > > > > > > > a bit of a amateur affair getting it going.
> > > > > > > >
> > > > > > > > 6. Slow xattr problem
> > > > > > > > I've a weird problem a couple of times, where RBD's with
> > > > > > > > data that hasn't been written to for a while seem to start
> > > > > > > > performing reads very slowly. With the help of Somnath in
> > > > > > > > a thread here we managed to track it down to a xattr
> > > > > > > > taking very long to be retrieved, but no
> > > > idea
> > > > > why.
> > > > > > > > Overwriting the RBD with fresh data seemed to stop it
> happening.
> > > > > > > > Hopefully Newstore might stop this happening in the future.
> > > > > > > >
> > > > > > > >>
> > > > > > > >> IE, should we be focusing on IOPS? Latency? Finding a
> > > > > > > >> way to avoid
> > > > > > > > journal
> > > > > > > >> overhead for large writes? Are there specific use cases
> > > > > > > >> where we should specifically be focusing attention?
> > > > > > > >> general
> > iscsi?
> > > S3?
> > > > > > > >> databases directly on RBD? etc. There's tons of
> > > > > > > >> different areas that we
> > > > > > > > can
> > > > > > > >> work on (general OSD threading improvements, different
> > > > > > > >> messenger implementations, newstore, client side
> > > > > > > >> bottlenecks,
> > > > > > > >> etc) but all of those things tackle different kinds of
> > problems.
> > > > > > > >>
> > > > > > > >> Mark
> > > > > > > >> _______________________________________________
> > > > > > > >> ceph-users mailing list
> > > > > > > >> ceph-***@lists.ceph.com
> > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-***@lists.ceph.com
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-***@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-***@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-***@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Christian Balzer

2015-08-19 02:32:02 UTC

Permalink

On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote:

[mega snip]
> 4. Disk based OSD with SSD Journal performance
> As I touched on above earlier, I would expect a disk based OSD with SSD
> journal to have similar performance to a pure SSD OSD when dealing with
> sequential small IO's. Currently the levelDB sync and potentially other
> things slow this down.
>

Has anybody tried symlinking the omap directory to a SSD and tested if hat
makes a (significant) difference?

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Nick Fisk

2015-08-19 09:02:25 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 19 August 2015 03:32
> To: ceph-***@lists.ceph.com
> Cc: Nick Fisk <***@fisk.me.uk>
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
> On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote:
>
> [mega snip]
> > 4. Disk based OSD with SSD Journal performance As I touched on above
> > earlier, I would expect a disk based OSD with SSD journal to have
> > similar performance to a pure SSD OSD when dealing with sequential
> > small IO's. Currently the levelDB sync and potentially other things
> > slow this down.
> >
>
> Has anybody tried symlinking the omap directory to a SSD and tested if hat
> makes a (significant) difference?

I thought I remember reading somewhere that all these items need to remain
on the OSD itself so that when the OSD calls fsync it can be sure they are
all in sync at the same time.

>
> Christian
> --
> Christian Balzer Network/Systems Engineer
> ***@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Christian Balzer

2015-08-20 00:23:51 UTC

Permalink

On Wed, 19 Aug 2015 10:02:25 +0100 Nick Fisk wrote:

>
>
>
>
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> > Of Christian Balzer
> > Sent: 19 August 2015 03:32
> > To: ceph-***@lists.ceph.com
> > Cc: Nick Fisk <***@fisk.me.uk>
> > Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >
> > On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote:
> >
> > [mega snip]
> > > 4. Disk based OSD with SSD Journal performance As I touched on above
> > > earlier, I would expect a disk based OSD with SSD journal to have
> > > similar performance to a pure SSD OSD when dealing with sequential
> > > small IO's. Currently the levelDB sync and potentially other things
> > > slow this down.
> > >
> >
> > Has anybody tried symlinking the omap directory to a SSD and tested if
> > hat makes a (significant) difference?
>
> I thought I remember reading somewhere that all these items need to
> remain on the OSD itself so that when the OSD calls fsync it can be sure
> they are all in sync at the same time.
>
Would be nice to have this confirmed by the devs.
It being leveldb, you'd think it would be in sync by default.

But even if it were potentially unsafe (not crash safe) in the current
incarnation, the results of such a test might make any needed changes
attractive.
Unfortunately I don't have anything resembling a SSD in my test cluster.

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Christian Balzer

2015-08-19 02:31:54 UTC

Permalink

On Tue, 18 Aug 2015 12:50:38 -0500 Mark Nelson wrote:

[snap]
> Probably the big question is what are the pain points? The most common
> answer we get when asking folks what applications they run on top of
> Ceph is "everything!". This is wonderful, but not helpful when trying
> to figure out what performance issues matter most! :)
>
Well, the "everything" answer really is the one everybody who runs VMs
backed by RBD for internal or external customers will give.
I.e. no idea what is installed and no control over how it accesses the
Ceph cluster.

And even when you think you have a predictable use case it might not be
true.
As in, one of our Ceph installs backs a ganeti cluster with hundreds of VMs
running 2 type of applications and from past experience I know their I/O
patterns (nearly 100% write only, any reads usually can be satisfied from
local or storage node pagecache).
Thus the Ceph cluster was configured in a way that was optimized for this
and it worked beautifully until:
a) scrubs became too heavy (generating too many read IOPS while also
invalidating page caches) and
b) somebody thought a 3rd type of VM using Windows with IOPS that equal
dozens of the other types would be a good idea.

> IE, should we be focusing on IOPS? Latency? Finding a way to avoid
> journal overhead for large writes? Are there specific use cases where
> we should specifically be focusing attention? general iscsi? S3?
> databases directly on RBD? etc. There's tons of different areas that we
> can work on (general OSD threading improvements, different messenger
> implementations, newstore, client side bottlenecks, etc) but all of
> those things tackle different kinds of problems.
>
All of these except S3 would have a positive impact in my various use
cases.
However at the risk of sounding like a broken record, any time spent on
these improvements before Ceph can recover from a scrub error fully
autonomously (read: checksums) would be a waste in my book.

All the speed in the world is pretty insignificant when a simple
"ceph pg repair" (which is still in the Ceph docs w/o any qualification of
what it actually does) has a good chance of wiping out good data "by
imposing the primary OSD's view of the world on the replicas", to quote
Greg.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Robert LeBlanc

2015-08-19 20:12:58 UTC

Permalink

> Probably the big question is what are the pain points? The most common
> answer we get when asking folks what applications they run on top of Ceph is
> "everything!". This is wonderful, but not helpful when trying to figure out
> what performance issues matter most! :)
>
> IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal
> overhead for large writes? Are there specific use cases where we should
> specifically be focusing attention? general iscsi? S3? databases directly
> on RBD? etc. There's tons of different areas that we can work on (general
> OSD threading improvements, different messenger implementations, newstore,
> client side bottlenecks, etc) but all of those things tackle different kinds
> of problems.

We run "everything" or it sure seems like it. I did some computation
of a large sampling of our servers and found that the average request
size was ~12K/~18K (read/write) and ~30%/70% (it looks like I didn't
save that spreadsheet to get exact numbers).

So, any optimization in smaller I/O sizes would really benefit us

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

Nick Fisk

2015-08-18 14:44:11 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 18 August 2015 14:51
> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
>
>
> On 08/18/2015 06:47 AM, Nick Fisk wrote:
> > Just to chime in, I gave dmcache a limited test but its lack of proper
> writeback cache ruled it out for me. It only performs write back caching on
> blocks already on the SSD, whereas I need something that works like a
> Battery backed raid controller caching all writes.
> >
> > It's amazing the 100x performance increase you get with RBD's when doing
> sync writes and give it something like just 1GB write back cache with
> flashcache.
>
> For your use case, is it ok that data may live on the flashcache for some
> amount of time before making to ceph to be replicated? We've wondered
> internally if this kind of trade-off is acceptable to customers or not should the
> flashcache SSD fail.

Yes, I agree, it's not ideal. But I believe it’s the only way to get the performance required for some workloads that need write latency's <1ms.

I'm still in testing at the moment with the testing kernel that includes blk-mq fixes for large queue depths and max io sizes. But if we decide to put into production, it would be using 2x SAS dual port SSD's in RAID1 across two servers for HA. As we are currently using iSCSI from these two servers, there is no real loss of availability by doing this. Generally I think as long as you build this around the fault domains of the application you are caching, it shouldn't impact too much.

I guess for people using openstack and other direct RBD interfaces it may not be such an attractive option. I've been thinking that maybe Ceph needs to have an additional daemon with very low overheads, which is run on SSD's to provide shared persistent cache devices for librbd. There's still a trade off, maybe not as much as using Flashcache, but for some workloads like database's, many people may decide that it's worth it. Of course I realise this would be a lot of work and everyone is really busy, but in terms of performance gained it would most likely have a dramatic effect in making Ceph look comparable to other solutions like VSAN or ScaleIO when it comes to high iops/low latency stuff.

>
> >
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> >> Of Jan Schermer
> >> Sent: 18 August 2015 12:44
> >> To: Mark Nelson <***@redhat.com>
> >> Cc: ceph-***@lists.ceph.com
> >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>
> >> I did not. Not sure why now - probably for the same reason I didn't
> >> extensively test bcache.
> >> I'm not a real fan of device mapper though, so if I had to choose I'd
> >> still go for bcache :-)
> >>
> >> Jan
> >>
> >>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> Out of curiosity did you ever try dm-cache? I've been meaning to
> >>> give it a
> >> spin but haven't had the spare cycles.
> >>>
> >>> Mark
> >>>
> >>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
> >>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
> >> backported 3.10 and 4.0 kernel-lt if I remember correctly).
> >>>> It worked fine during benchmarks and stress tests, but once we run
> >>>> DB2
> >> on it it panicked within minutes and took all the data with it
> >> (almost literally - files that werent touched, like OS binaries were
> >> b0rked and the filesystem was unsalvageable).
> >>>> If you disregard this warning - the performance gains weren't that
> >>>> great
> >> either, at least in a VM. It had problems when flushing to disk after
> >> reaching dirty watermark and the block size has some
> >> not-well-documented implications (not sure now, but I think it only
> >> cached IO _larger_than the block size, so if your database keeps
> >> incrementing an XX-byte counter it will go straight to disk).
> >>>>
> >>>> Flashcache doesn't respect barriers (or does it now?) - if that's
> >>>> ok for you
> >> than go for it, it should be stable and I used it in the past in
> >> production without problems.
> >>>>
> >>>> bcache seemed to work fine, but I needed to
> >>>> a) use it for root
> >>>> b) disable and enable it on the fly (doh)
> >>>> c) make it non-persisent (flush it) before reboot - not sure if
> >>>> that was
> >> possible either.
> >>>> d) all that in a customer's VM, and that customer didn't have a
> >>>> strong
> >> technical background to be able to fiddle with it...
> >>>> So I haven't tested it heavily.
> >>>>
> >>>> Bcache should be the obvious choice if you are in control of the
> >>>> environment. At least you can cry on LKML's shoulder when you lose
> >>>> data :-)
> >>>>
> >>>> Jan
> >>>>
> >>>>
> >>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com>
> >> wrote:
> >>>>>
> >>>>> What about https://github.com/Frontier314/EnhanceIO? Last commit
> >>>>> 2 months ago, but no external contributors :(
> >>>>>
> >>>>> The nice thing about EnhanceIO is there is no need to change
> >>>>> device name, unlike bcache, flashcache etc.
> >>>>>
> >>>>> Best regards,
> >>>>> Alex
> >>>>>
> >>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
> >>>>> <***@redhat.com>
> >> wrote:
> >>>>>> I did some (non-ceph) work on these, and concluded that bcache
> >>>>>> was the best supported, most stable, and fastest. This was ~1
> >>>>>> year ago, to take it with a grain of salt, but that's what I would
> recommend.
> >>>>>>
> >>>>>> Daniel
> >>>>>>
> >>>>>>
> >>>>>> ________________________________
> >>>>>> From: "Dominik Zalewski" <***@optlink.net>
> >>>>>> To: "German Anders" <***@despegar.com>
> >>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
> >>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
> >>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I’ve asked same question last weeks or so (just search the
> >>>>>> mailing list archives for EnhanceIO :) and got some interesting
> answers.
> >>>>>>
> >>>>>> Looks like the project is pretty much dead since it was bought
> >>>>>> out by
> >> HGST.
> >>>>>> Even their website has some broken links in regards to EnhanceIO
> >>>>>>
> >>>>>> I’m keen to try flashcache or bcache (its been in the mainline
> >>>>>> kernel for some time)
> >>>>>>
> >>>>>> Dominik
> >>>>>>
> >>>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com>
> >> wrote:
> >>>>>>
> >>>>>> Hi cephers,
> >>>>>>
> >>>>>> Is anyone out there that implement enhanceIO in a production
> >> environment?
> >>>>>> any recommendation? any perf output to share with the diff
> >>>>>> between using it and not?
> >>>>>>
> >>>>>> Thanks in advance,
> >>>>>>
> >>>>>> German
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-***@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-***@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-***@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-***@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-***@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jan Schermer

2015-08-18 16:12:37 UTC

Permalink

> On 18 Aug 2015, at 16:44, Nick Fisk <***@fisk.me.uk> wrote:
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 18 August 2015 14:51
>> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
>> Cc: ceph-***@lists.ceph.com
>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>
>>
>>
>> On 08/18/2015 06:47 AM, Nick Fisk wrote:
>>> Just to chime in, I gave dmcache a limited test but its lack of proper
>> writeback cache ruled it out for me. It only performs write back caching on
>> blocks already on the SSD, whereas I need something that works like a
>> Battery backed raid controller caching all writes.
>>>
>>> It's amazing the 100x performance increase you get with RBD's when doing
>> sync writes and give it something like just 1GB write back cache with
>> flashcache.
>>
>> For your use case, is it ok that data may live on the flashcache for some
>> amount of time before making to ceph to be replicated? We've wondered
>> internally if this kind of trade-off is acceptable to customers or not should the
>> flashcache SSD fail.
>
> Yes, I agree, it's not ideal. But I believe it’s the only way to get the performance required for some workloads that need write latency's <1ms.
>
> I'm still in testing at the moment with the testing kernel that includes blk-mq fixes for large queue depths and max io sizes. But if we decide to put into production, it would be using 2x SAS dual port SSD's in RAID1 across two servers for HA. As we are currently using iSCSI from these two servers, there is no real loss of availability by doing this. Generally I think as long as you build this around the fault domains of the application you are caching, it shouldn't impact too much.
>
> I guess for people using openstack and other direct RBD interfaces it may not be such an attractive option. I've been thinking that maybe Ceph needs to have an additional daemon with very low overheads, which is run on SSD's to provide shared persistent cache devices for librbd. There's still a trade off, maybe not as much as using Flashcache, but for some workloads like database's, many people may decide that it's worth it. Of course I realise this would be a lot of work and everyone is really busy, but in terms of performance gained it would most likely have a dramatic effect in making Ceph look comparable to other solutions like VSAN or ScaleIO when it comes to high iops/low latency stuff.
>

Additional daemon that is persistent how? Isn't that what journal does already, just too slowly?

I think the best (and easiest!) approach is to mimic what a monilithic SAN does

Currently
1) client issues blocking/atomic/sync IO
2) rbd client sends this IO to all OSDs
3) after all OSDs "process the IO", the IO is finished and considered persistent

That has serious implications
* every IO is processed separately, not much coalescing
* OSD processes add the latency when processing this IO
* one OSD can be slow momentarily, IO backs up and the cluster stalls

Let me just select what "processing the IO" means with respect to my architecture and I can likely get a 100x improvement

Let me choose:

1) WHERE the IO is persisted
Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2) sufficient?
Not waiting for one slow OSD gives me at least some SLA for planned tasks like backfilling, scrubbing, deep-scrubbing
Hands up who can afford to leav deep-scrub enabled in production...

2) WHEN the IO is persisted
Do I really need all OSDs to flush the data to disk?
If all the nodes are in the same cabinet and on the same UPS then this makes sense.
But my nodes are actually in different buildings ~10km apart. The chances of power failing simultaneously, N+1 UPSes failing simultaneously, diesels failing simultaneously... When nukes start falling and this happens then I'll start looking for backups.
Even if your nodes are in one datacentre, there are likely redundant (2+) circuits.
And even if you have just one cabinet, you can add 3x UPS in there and gain a nice speed boost.

So the IO could be actually pretty safe and happy when it gets to a remote buffers on enough (quorum) nodes and waits for processing. It can be batched, it can be coalesced, it can be rewritten with subsequent updates...

3) WHAT amount of IO is stored
Do I need to have the last transaction or can I tolerate 1 minute of missing data?
Checkpoints, checksums on last transaction, rollback (journal already does this AFAIK)...

4) I DON'T CARE mode :-)
qemu cache=unsafe equivalent but set on a RBD volume/pool
Because sometimes you just need to crunch data without really storing them persistently - how are CERN/HADOOP/Big Data guys approcaching this?
And you can't always disable flushing. Filesystems have "nobarriers" (usually) but if you need a block device for raw database tablespace, you're pretty much SOL without lots of trickery

1) is doable eventually.

2) is doable almost immediately
a) just ACK the IO when you get it, let the client unblock on quorum
or
b) drop the journal, write all data asynchronously, let the filesystem handle consistency and let me tune dirty_writeback_centisecs to get the goal i want in respect to 3)

4) simple to do, unusable for production (for most of us)
but flushing is expensive so why flush because a file metadata changed on a QA machine?
Dev&QA often create a higher load than production itself..

sorry, got carried away, again....

Jan

>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
>>>> Of Jan Schermer
>>>> Sent: 18 August 2015 12:44
>>>> To: Mark Nelson <***@redhat.com>
>>>> Cc: ceph-***@lists.ceph.com
>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>
>>>> I did not. Not sure why now - probably for the same reason I didn't
>>>> extensively test bcache.
>>>> I'm not a real fan of device mapper though, so if I had to choose I'd
>>>> still go for bcache :-)
>>>>
>>>> Jan
>>>>
>>>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com> wrote:
>>>>>
>>>>> Hi Jan,
>>>>>
>>>>> Out of curiosity did you ever try dm-cache? I've been meaning to
>>>>> give it a
>>>> spin but haven't had the spare cycles.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
>>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
>>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
>>>>>> It worked fine during benchmarks and stress tests, but once we run
>>>>>> DB2
>>>> on it it panicked within minutes and took all the data with it
>>>> (almost literally - files that werent touched, like OS binaries were
>>>> b0rked and the filesystem was unsalvageable).
>>>>>> If you disregard this warning - the performance gains weren't that
>>>>>> great
>>>> either, at least in a VM. It had problems when flushing to disk after
>>>> reaching dirty watermark and the block size has some
>>>> not-well-documented implications (not sure now, but I think it only
>>>> cached IO _larger_than the block size, so if your database keeps
>>>> incrementing an XX-byte counter it will go straight to disk).
>>>>>>
>>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's
>>>>>> ok for you
>>>> than go for it, it should be stable and I used it in the past in
>>>> production without problems.
>>>>>>
>>>>>> bcache seemed to work fine, but I needed to
>>>>>> a) use it for root
>>>>>> b) disable and enable it on the fly (doh)
>>>>>> c) make it non-persisent (flush it) before reboot - not sure if
>>>>>> that was
>>>> possible either.
>>>>>> d) all that in a customer's VM, and that customer didn't have a
>>>>>> strong
>>>> technical background to be able to fiddle with it...
>>>>>> So I haven't tested it heavily.
>>>>>>
>>>>>> Bcache should be the obvious choice if you are in control of the
>>>>>> environment. At least you can cry on LKML's shoulder when you lose
>>>>>> data :-)
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>>
>>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com>
>>>> wrote:
>>>>>>>
>>>>>>> What about https://github.com/Frontier314/EnhanceIO? Last commit
>>>>>>> 2 months ago, but no external contributors :(
>>>>>>>
>>>>>>> The nice thing about EnhanceIO is there is no need to change
>>>>>>> device name, unlike bcache, flashcache etc.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Alex
>>>>>>>
>>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
>>>>>>> <***@redhat.com>
>>>> wrote:
>>>>>>>> I did some (non-ceph) work on these, and concluded that bcache
>>>>>>>> was the best supported, most stable, and fastest. This was ~1
>>>>>>>> year ago, to take it with a grain of salt, but that's what I would
>> recommend.
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>>>>>> To: "German Anders" <***@despegar.com>
>>>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>>>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I’ve asked same question last weeks or so (just search the
>>>>>>>> mailing list archives for EnhanceIO :) and got some interesting
>> answers.
>>>>>>>>
>>>>>>>> Looks like the project is pretty much dead since it was bought
>>>>>>>> out by
>>>> HGST.
>>>>>>>> Even their website has some broken links in regards to EnhanceIO
>>>>>>>>
>>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
>>>>>>>> kernel for some time)
>>>>>>>>
>>>>>>>> Dominik
>>>>>>>>
>>>>>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> Hi cephers,
>>>>>>>>
>>>>>>>> Is anyone out there that implement enhanceIO in a production
>>>> environment?
>>>>>>>> any recommendation? any perf output to share with the diff
>>>>>>>> between using it and not?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>>
>>>>>>>> German
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-***@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-***@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-***@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-***@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>

Nick Fisk

2015-08-18 17:05:45 UTC

Permalink

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 17:13
> To: Nick Fisk <***@fisk.me.uk>
> Cc: ceph-***@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>
>
> > On 18 Aug 2015, at 16:44, Nick Fisk <***@fisk.me.uk> wrote:
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf
> >> Of Mark Nelson
> >> Sent: 18 August 2015 14:51
> >> To: Nick Fisk <***@fisk.me.uk>; 'Jan Schermer' <***@schermer.cz>
> >> Cc: ceph-***@lists.ceph.com
> >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>
> >>
> >>
> >> On 08/18/2015 06:47 AM, Nick Fisk wrote:
> >>> Just to chime in, I gave dmcache a limited test but its lack of
> >>> proper
> >> writeback cache ruled it out for me. It only performs write back
> >> caching on blocks already on the SSD, whereas I need something that
> >> works like a Battery backed raid controller caching all writes.
> >>>
> >>> It's amazing the 100x performance increase you get with RBD's when
> >>> doing
> >> sync writes and give it something like just 1GB write back cache with
> >> flashcache.
> >>
> >> For your use case, is it ok that data may live on the flashcache for
> >> some amount of time before making to ceph to be replicated? We've
> >> wondered internally if this kind of trade-off is acceptable to
> >> customers or not should the flashcache SSD fail.
> >
> > Yes, I agree, it's not ideal. But I believe it’s the only way to get the
> performance required for some workloads that need write latency's <1ms.
> >
> > I'm still in testing at the moment with the testing kernel that includes blk-
> mq fixes for large queue depths and max io sizes. But if we decide to put into
> production, it would be using 2x SAS dual port SSD's in RAID1 across two
> servers for HA. As we are currently using iSCSI from these two servers, there
> is no real loss of availability by doing this. Generally I think as long as you build
> this around the fault domains of the application you are caching, it shouldn't
> impact too much.
> >
> > I guess for people using openstack and other direct RBD interfaces it may
> not be such an attractive option. I've been thinking that maybe Ceph needs
> to have an additional daemon with very low overheads, which is run on SSD's
> to provide shared persistent cache devices for librbd. There's still a trade off,
> maybe not as much as using Flashcache, but for some workloads like
> database's, many people may decide that it's worth it. Of course I realise this
> would be a lot of work and everyone is really busy, but in terms of
> performance gained it would most likely have a dramatic effect in making
> Ceph look comparable to other solutions like VSAN or ScaleIO when it comes
> to high iops/low latency stuff.
> >
>
> Additional daemon that is persistent how? Isn't that what journal does
> already, just too slowly?

The journal is part of an OSD, as is speed restricted by a lot of the functionality that Ceph has to provide. I was more thinking of a very light weight "service" that acts as an interface between a SSD and librbd and is focussed on speed. For something like a standalone SQL server it might run on the SQL server with a local SSD, but in other scenarios you might have this "service" remote where the SSD's are installed. HA for the SSD could be provided by RAID+Dual Port SAS, or maybe some sort of lightweight replication could be built into the service.

This was just a random though rather than something I have planned out though.

>
> I think the best (and easiest!) approach is to mimic what a monilithic SAN
> does
>
> Currently
> 1) client issues blocking/atomic/sync IO
> 2) rbd client sends this IO to all OSDs
> 3) after all OSDs "process the IO", the IO is finished and considered persistent
>
> That has serious implications
> * every IO is processed separately, not much coalescing
> * OSD processes add the latency when processing this IO
> * one OSD can be slow momentarily, IO backs up and the cluster
> stalls
>
> Let me just select what "processing the IO" means with respect to my
> architecture and I can likely get a 100x improvement
>
> Let me choose:
>
> 1) WHERE the IO is persisted
> Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2)
> sufficient?
> Not waiting for one slow OSD gives me at least some SLA for planned tasks
> like backfilling, scrubbing, deep-scrubbing Hands up who can afford to leav
> deep-scrub enabled in production...

In my testing the difference between 2 and 3 Replica's wasn't that much, as once the primary OSD sends out the replica's they happen more or less in parallel.

>
> 2) WHEN the IO is persisted
> Do I really need all OSDs to flush the data to disk?
> If all the nodes are in the same cabinet and on the same UPS then this makes
> sense.
> But my nodes are actually in different buildings ~10km apart. The chances of
> power failing simultaneously, N+1 UPSes failing simultaneously, diesels failing
> simultaneously... When nukes start falling and this happens then I'll start
> looking for backups.
> Even if your nodes are in one datacentre, there are likely redundant (2+)
> circuits.
> And even if you have just one cabinet, you can add 3x UPS in there and gain a
> nice speed boost.
>
> So the IO could be actually pretty safe and happy when it gets to a remote
> buffers on enough (quorum) nodes and waits for processing. It can be
> batched, it can be coalesced, it can be rewritten with subsequent updates...

Agreed, it would be nice if once the primary OSD has written its data, it returns an ACK to the client. It is then responsible for ensuring data is written to other OSD's later on. Is this what the Async Messenger does????

>
> 3) WHAT amount of IO is stored
> Do I need to have the last transaction or can I tolerate 1 minute of missing
> data?
> Checkpoints, checksums on last transaction, rollback (journal already does
> this AFAIK)...
>
> 4) I DON'T CARE mode :-)
> qemu cache=unsafe equivalent but set on a RBD volume/pool Because
> sometimes you just need to crunch data without really storing them
> persistently - how are CERN/HADOOP/Big Data guys approcaching this?
> And you can't always disable flushing. Filesystems have "nobarriers" (usually)
> but if you need a block device for raw database tablespace, you're pretty
> much SOL without lots of trickery
>
>
> 1) is doable eventually.
>
> 2) is doable almost immediately
> a) just ACK the IO when you get it, let the client unblock on quorum
> or
> b) drop the journal, write all data asynchronously, let the filesystem
> handle consistency and let me tune dirty_writeback_centisecs to get the goal
> i want in respect to 3)
>
> 4) simple to do, unusable for production (for most of us)
> but flushing is expensive so why flush because a file metadata
> changed on a QA machine?
> Dev&QA often create a higher load than production itself..
>
> sorry, got carried away, again....
>
> Jan
>
>
> >>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On
> >>>> Behalf Of Jan Schermer
> >>>> Sent: 18 August 2015 12:44
> >>>> To: Mark Nelson <***@redhat.com>
> >>>> Cc: ceph-***@lists.ceph.com
> >>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>>>
> >>>> I did not. Not sure why now - probably for the same reason I didn't
> >>>> extensively test bcache.
> >>>> I'm not a real fan of device mapper though, so if I had to choose
> >>>> I'd still go for bcache :-)
> >>>>
> >>>> Jan
> >>>>
> >>>>> On 18 Aug 2015, at 13:33, Mark Nelson <***@redhat.com>
> wrote:
> >>>>>
> >>>>> Hi Jan,
> >>>>>
> >>>>> Out of curiosity did you ever try dm-cache? I've been meaning to
> >>>>> give it a
> >>>> spin but haven't had the spare cycles.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
> >>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
> >>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
> >>>>>> It worked fine during benchmarks and stress tests, but once we
> >>>>>> run
> >>>>>> DB2
> >>>> on it it panicked within minutes and took all the data with it
> >>>> (almost literally - files that werent touched, like OS binaries
> >>>> were b0rked and the filesystem was unsalvageable).
> >>>>>> If you disregard this warning - the performance gains weren't
> >>>>>> that great
> >>>> either, at least in a VM. It had problems when flushing to disk
> >>>> after reaching dirty watermark and the block size has some
> >>>> not-well-documented implications (not sure now, but I think it only
> >>>> cached IO _larger_than the block size, so if your database keeps
> >>>> incrementing an XX-byte counter it will go straight to disk).
> >>>>>>
> >>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's
> >>>>>> ok for you
> >>>> than go for it, it should be stable and I used it in the past in
> >>>> production without problems.
> >>>>>>
> >>>>>> bcache seemed to work fine, but I needed to
> >>>>>> a) use it for root
> >>>>>> b) disable and enable it on the fly (doh)
> >>>>>> c) make it non-persisent (flush it) before reboot - not sure if
> >>>>>> that was
> >>>> possible either.
> >>>>>> d) all that in a customer's VM, and that customer didn't have a
> >>>>>> strong
> >>>> technical background to be able to fiddle with it...
> >>>>>> So I haven't tested it heavily.
> >>>>>>
> >>>>>> Bcache should be the obvious choice if you are in control of the
> >>>>>> environment. At least you can cry on LKML's shoulder when you
> >>>>>> lose data :-)
> >>>>>>
> >>>>>> Jan
> >>>>>>
> >>>>>>
> >>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev
> >>>>>>> <***@iss-integration.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> What about https://github.com/Frontier314/EnhanceIO? Last
> >>>>>>> commit
> >>>>>>> 2 months ago, but no external contributors :(
> >>>>>>>
> >>>>>>> The nice thing about EnhanceIO is there is no need to change
> >>>>>>> device name, unlike bcache, flashcache etc.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
> >>>>>>> <***@redhat.com>
> >>>> wrote:
> >>>>>>>> I did some (non-ceph) work on these, and concluded that bcache
> >>>>>>>> was the best supported, most stable, and fastest. This was ~1
> >>>>>>>> year ago, to take it with a grain of salt, but that's what I
> >>>>>>>> would
> >> recommend.
> >>>>>>>>
> >>>>>>>> Daniel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: "Dominik Zalewski" <***@optlink.net>
> >>>>>>>> To: "German Anders" <***@despegar.com>
> >>>>>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
> >>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
> >>>>>>>> Subject: Re: [ceph-users] any recommendation of using
> EnhanceIO?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I’ve asked same question last weeks or so (just search the
> >>>>>>>> mailing list archives for EnhanceIO :) and got some interesting
> >> answers.
> >>>>>>>>
> >>>>>>>> Looks like the project is pretty much dead since it was bought
> >>>>>>>> out by
> >>>> HGST.
> >>>>>>>> Even their website has some broken links in regards to
> >>>>>>>> EnhanceIO
> >>>>>>>>
> >>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
> >>>>>>>> kernel for some time)
> >>>>>>>>
> >>>>>>>> Dominik
> >>>>>>>>
> >>>>>>>> On 1 Jul 2015, at 21:13, German Anders
> <***@despegar.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi cephers,
> >>>>>>>>
> >>>>>>>> Is anyone out there that implement enhanceIO in a production
> >>>> environment?
> >>>>>>>> any recommendation? any perf output to share with the diff
> >>>>>>>> between using it and not?
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>>
> >>>>>>>> German
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-***@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-***@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-***@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>> ceph-***@lists.ceph.com
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-***@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-***@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-***@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-***@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Alex Gorbachev

2015-08-18 16:30:15 UTC

Permalink

HI Jan,

On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer <***@schermer.cz> wrote:
> I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly).
> It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable).

Out of curiosity, were you using EnhanceIO in writeback mode? I
assume so, as a read cache should not hurt anything.

Thanks,
Alex

> If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk).
>
> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems.
>
> bcache seemed to work fine, but I needed to
> a) use it for root
> b) disable and enable it on the fly (doh)
> c) make it non-persisent (flush it) before reboot - not sure if that was possible either.
> d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it...
> So I haven't tested it heavily.
>
> Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-)
>
> Jan
>
>
>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>
>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>> months ago, but no external contributors :(
>>
>> The nice thing about EnhanceIO is there is no need to change device
>> name, unlike bcache, flashcache etc.
>>
>> Best regards,
>> Alex
>>
>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com> wrote:
>>> I did some (non-ceph) work on these, and concluded that bcache was the best
>>> supported, most stable, and fastest. This was ~1 year ago, to take it with
>>> a grain of salt, but that's what I would recommend.
>>>
>>> Daniel
>>>
>>>
>>> ________________________________
>>> From: "Dominik Zalewski" <***@optlink.net>
>>> To: "German Anders" <***@despegar.com>
>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>
>>>
>>> Hi,
>>>
>>> I’ve asked same question last weeks or so (just search the mailing list
>>> archives for EnhanceIO :) and got some interesting answers.
>>>
>>> Looks like the project is pretty much dead since it was bought out by HGST.
>>> Even their website has some broken links in regards to EnhanceIO
>>>
>>> I’m keen to try flashcache or bcache (its been in the mainline kernel for
>>> some time)
>>>
>>> Dominik
>>>
>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>
>>> Hi cephers,
>>>
>>> Is anyone out there that implement enhanceIO in a production environment?
>>> any recommendation? any perf output to share with the diff between using it
>>> and not?
>>>
>>> Thanks in advance,
>>>
>>> German
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-***@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Jan Schermer

2015-08-18 16:58:51 UTC

Permalink

Yes, writeback mode. I didn't try anything else.

Jan

> On 18 Aug 2015, at 18:30, Alex Gorbachev <***@iss-integration.com> wrote:
>
> HI Jan,
>
> On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer <***@schermer.cz> wrote:
>> I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly).
>> It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable).
>
> Out of curiosity, were you using EnhanceIO in writeback mode? I
> assume so, as a read cache should not hurt anything.
>
> Thanks,
> Alex
>
>> If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk).
>>
>> Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems.
>>
>> bcache seemed to work fine, but I needed to
>> a) use it for root
>> b) disable and enable it on the fly (doh)
>> c) make it non-persisent (flush it) before reboot - not sure if that was possible either.
>> d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it...
>> So I haven't tested it heavily.
>>
>> Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-)
>>
>> Jan
>>
>>
>>> On 18 Aug 2015, at 01:49, Alex Gorbachev <***@iss-integration.com> wrote:
>>>
>>> What about https://github.com/Frontier314/EnhanceIO? Last commit 2
>>> months ago, but no external contributors :(
>>>
>>> The nice thing about EnhanceIO is there is no need to change device
>>> name, unlike bcache, flashcache etc.
>>>
>>> Best regards,
>>> Alex
>>>
>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz <***@redhat.com> wrote:
>>>> I did some (non-ceph) work on these, and concluded that bcache was the best
>>>> supported, most stable, and fastest. This was ~1 year ago, to take it with
>>>> a grain of salt, but that's what I would recommend.
>>>>
>>>> Daniel
>>>>
>>>>
>>>> ________________________________
>>>> From: "Dominik Zalewski" <***@optlink.net>
>>>> To: "German Anders" <***@despegar.com>
>>>> Cc: "ceph-users" <ceph-***@lists.ceph.com>
>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
>>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I’ve asked same question last weeks or so (just search the mailing list
>>>> archives for EnhanceIO :) and got some interesting answers.
>>>>
>>>> Looks like the project is pretty much dead since it was bought out by HGST.
>>>> Even their website has some broken links in regards to EnhanceIO
>>>>
>>>> I’m keen to try flashcache or bcache (its been in the mainline kernel for
>>>> some time)
>>>>
>>>> Dominik
>>>>
>>>> On 1 Jul 2015, at 21:13, German Anders <***@despegar.com> wrote:
>>>>
>>>> Hi cephers,
>>>>
>>>> Is anyone out there that implement enhanceIO in a production environment?
>>>> any recommendation? any perf output to share with the diff between using it
>>>> and not?
>>>>
>>>> Thanks in advance,
>>>>
>>>> German
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-***@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-***@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>