[ceph-users] Slow IOPS on RBD compared to journal and backing devices

Discussion:

Christian Balzer

2014-05-08 00:57:46 UTC

Hello,

ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.

Running this fio:

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128

results in:

30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD

When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.

Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.

Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.

I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.

Regards,

Christian

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Gregory Farnum

2014-05-08 01:37:48 UTC

Permalink

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that. If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

Christian Balzer

2014-05-08 02:49:16 UTC

Permalink

Post by Gregory Farnum

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.

Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.

Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.

Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.

I think we can rule out the network, NPtcp gives me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and still
isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^

Regards,

Christian

Post by Gregory Farnum
-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Christian Balzer

2014-05-08 02:49:16 UTC

Permalink

Post by Gregory Farnum

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

Post by Gregory Farnum
-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

xan.peng

2014-05-15 06:02:36 UTC

Permalink

Post by Gregory Farnum
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long.

Maybe this is off the topic, AFAIK "--iodepth=128" doesn't submits 128
IOs at a time.
There is a option of fio "iodepth_batch_submit=int", which defaults
to 1, makes fio submit
each IO as soon as it is available.

See more: http://www.bluestop.org/fio/HOWTO.txt

Alexandre DERUMIER

2014-05-08 04:33:51 UTC

Permalink

Hi Christian,

Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)

Aslo, I known that direct ios can be quite slow with ceph,

maybe can you try without --direct=1

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true

----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Gregory Farnum" <greg at inktank.com>, ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

Post by Gregory Farnum

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

Post by Gregory Farnum
-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Gregory Farnum

2014-05-08 05:13:53 UTC

Permalink

Oh, I didn't notice that. I bet you aren't getting the expected throughput
on the RAID array with OSD access patterns, and that's applying back
pressure on the journal.

When I suggested other tests, I meant with and without Ceph. One particular
one is OSD bench. That should be interesting to try at a variety of block
sizes. You could also try runnin RADOS bench and smalliobench at a few
different sizes.
-Greg

Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer <chibi at gol.com<javascript:;>>

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.

Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.

---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^
Regards,
Christian

-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Software Engineer #42 @ http://inktank.com | http://ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/c2e48328/attachment.htm>

Christian Balzer

2014-05-08 06:26:33 UTC

Permalink

Hello,

Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.

I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?

Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.

I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}

real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s.

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}

real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622

Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.

Regards,

Christian

Post by Gregory Farnum
-Greg

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.

that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.

Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian

-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Christian Balzer

2014-05-08 06:26:33 UTC

Permalink

Hello,

Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.

Post by Gregory Farnum
-Greg

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.

that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.

Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian

-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Christian Balzer

2014-05-09 02:01:26 UTC

Permalink

Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.

In the a "picture" being worth a thousand words tradition, I give you
this iostat -x output taken during a fio run:

avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00 9.13 0.18 0.11 0.00 0.11 0.01 1.40
sdb 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 0.25 0.00 0.25 0.02 2.00
sdc 0.00 5.00 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 0.10 0.09 22.00
sdd 0.00 6.50 0.00 1913.00 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60

The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.

sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.

The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.

Regards,

Christian

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.

that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.

Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian

-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Christian Balzer

2014-05-13 09:03:47 UTC

Permalink

I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian

Post by Christian Balzer

Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.

In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.
Regards,
Christian

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links
and nothing changes.

that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.

Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.

---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian

-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Josef Johansson

2014-05-14 09:29:47 UTC

Permalink

Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.

A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.

Cheers,
Josef

Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian

Post by Christian Balzer

Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>

Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links
and nothing changes.

that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.

Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.

---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian

-Greg

--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Stefan Priebe - Profihost AG

2014-05-14 09:37:39 UTC

Permalink

Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.

i did the same with bobtail a year ago and was still limited to nearly
the same values. No idea what firefly will say. I'm pretty sure the
limit is in the ceph code itself.

There were a short discussion here:
http://www.spinics.net/lists/ceph-devel/msg18731.html

Stefan

Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Cheers,
Josef