Discussion:
[ceph-users] Slow IOPS on RBD compared to journal and backing devices
Christian Balzer
2014-05-08 00:57:46 UTC
Permalink
Hello,

ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.

Running this fio:

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128

results in:

30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD

When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.

Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.

Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.

I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Gregory Farnum
2014-05-08 01:37:48 UTC
Permalink
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that. If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Christian Balzer
2014-05-08 02:49:16 UTC
Permalink
Post by Gregory Farnum
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.
Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
I think we can rule out the network, NPtcp gives me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and still
isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^

Regards,

Christian
Post by Gregory Farnum
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-08 02:49:16 UTC
Permalink
Post by Gregory Farnum
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.
Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
I think we can rule out the network, NPtcp gives me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and still
isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^

Regards,

Christian
Post by Gregory Farnum
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
xan.peng
2014-05-15 06:02:36 UTC
Permalink
Post by Gregory Farnum
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long.
Maybe this is off the topic, AFAIK "--iodepth=128" doesn't submits 128
IOs at a time.
There is a option of fio "iodepth_batch_submit=int", which defaults
to 1, makes fio submit
each IO as soon as it is available.

See more: http://www.bluestop.org/fio/HOWTO.txt
Alexandre DERUMIER
2014-05-08 04:33:51 UTC
Permalink
Hi Christian,

Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)


Aslo, I known that direct ios can be quite slow with ceph,

maybe can you try without --direct=1

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true




----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Gregory Farnum" <greg at inktank.com>, ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Gregory Farnum
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.
Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
I think we can rule out the network, NPtcp gives me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and still
isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^

Regards,

Christian
Post by Gregory Farnum
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Gregory Farnum
2014-05-08 05:13:53 UTC
Permalink
Oh, I didn't notice that. I bet you aren't getting the expected throughput
on the RAID array with OSD access patterns, and that's applying back
pressure on the journal.

When I suggested other tests, I meant with and without Ceph. One particular
one is OSD bench. That should be interesting to try at a variety of block
sizes. You could also try runnin RADOS bench and smalliobench at a few
different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer <chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Software Engineer #42 @ http://inktank.com | http://ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/c2e48328/attachment.htm>
Christian Balzer
2014-05-08 06:26:33 UTC
Permalink
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}


real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s.

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}


real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622

Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.

Regards,

Christian
Post by Gregory Farnum
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-08 06:26:33 UTC
Permalink
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}


real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s.

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}


real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622

Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.

Regards,

Christian
Post by Gregory Farnum
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-09 02:01:26 UTC
Permalink
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
this iostat -x output taken during a fio run:

avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00 9.13 0.18 0.11 0.00 0.11 0.01 1.40
sdb 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 0.25 0.00 0.25 0.02 2.00
sdc 0.00 5.00 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 0.10 0.09 22.00
sdd 0.00 6.50 0.00 1913.00 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60

The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.

sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.

The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.

Regards,

Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-13 09:03:47 UTC
Permalink
I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links
and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Josef Johansson
2014-05-14 09:29:47 UTC
Permalink
Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.

A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.

Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links
and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Stefan Priebe - Profihost AG
2014-05-14 09:37:39 UTC
Permalink
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
i did the same with bobtail a year ago and was still limited to nearly
the same values. No idea what firefly will say. I'm pretty sure the
limit is in the ceph code itself.

There were a short discussion here:
http://www.spinics.net/lists/ceph-devel/msg18731.html

Stefan
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links
and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-05-14 12:33:06 UTC
Permalink
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Looking forward to that. ^^


Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Josef Johansson
2014-05-14 22:26:30 UTC
Permalink
Hi,

So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.

CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y

Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)

Created the osd with following:

root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root at osd1:/# mkfs.xfs /dev/loop0
root at osd1:/# ceph osd create
50
root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file:
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1 ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument

Cheers,
Josef
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Stefan Priebe - Profihost AG
2014-05-15 07:11:43 UTC
Permalink
Post by Josef Johansson
Hi,
So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)
I would create an empty file in tmpfs and then format that file as a
block device.
Post by Josef Johansson
root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root at osd1:/# mkfs.xfs /dev/loop0
root at osd1:/# ceph osd create
50
root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1 ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument
Cheers,
Josef
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Josef Johansson
2014-05-15 07:56:30 UTC
Permalink
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
Hi,
So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)
I would create an empty file in tmpfs and then format that file as a
block device.
How do you mean exactly? Creating with dd and mounting with losetup?

Cheers,
Josef
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root at osd1:/# mkfs.xfs /dev/loop0
root at osd1:/# ceph osd create
50
root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1 ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument
Cheers,
Josef
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Stefan Priebe - Profihost AG
2014-05-15 07:58:51 UTC
Permalink
Post by Josef Johansson
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
Hi,
So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)
I would create an empty file in tmpfs and then format that file as a
block device.
How do you mean exactly? Creating with dd and mounting with losetup?
mount -t tmpfs -o size=4G /mnt /mnt
dd if=/dev/zero of=/mnt/blockdev_a bs=1M count=4000
mkfs.xfs -f /mnt/blockdev_a
mount /mnt/blockdev_a /ceph/osd.X

Dann /mnt/blockdev_a als OSD device nutzen.
Post by Josef Johansson
Cheers,
Josef
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root at osd1:/# mkfs.xfs /dev/loop0
root at osd1:/# ceph osd create
50
root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1 ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument
Cheers,
Josef
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Josef Johansson
2014-06-13 18:19:06 UTC
Permalink
Hey,

I did try this, it didn't work though, so I think I still have to patch
the kernel though, as the user_xattr is not allowed on tmpfs.

Thanks for the description though.

I think the next step in this is to do it all virtual, maybe on the same
hardware to avoid network.
Any problems with doing it all virtual? If it's just memory and the same
machine, we should see the pure ceph performance right?

Anyone done this?

Cheers,
Josef
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
Hi,
So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)
I would create an empty file in tmpfs and then format that file as a
block device.
How do you mean exactly? Creating with dd and mounting with losetup?
mount -t tmpfs -o size=4G /mnt /mnt
dd if=/dev/zero of=/mnt/blockdev_a bs=1M count=4000
mkfs.xfs -f /mnt/blockdev_a
mount /mnt/blockdev_a /ceph/osd.X
Dann /mnt/blockdev_a als OSD device nutzen.
Post by Josef Johansson
Cheers,
Josef
Post by Stefan Priebe - Profihost AG
Post by Josef Johansson
root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root at osd1:/# mkfs.xfs /dev/loop0
root at osd1:/# ceph osd create
50
root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
aio not supported without directio; disabling aio
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1 ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument
Cheers,
Josef
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get them
done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-05-23 15:57:50 UTC
Permalink
For what it's worth (very little in my case)...

Since the cluster wasn't in production yet and Firefly (0.80.1) did hit
Debian Jessie today I upgraded it.

Big mistake...

I did the recommended upgrade song and dance, MONs first, OSDs after that.

Then applied "ceph osd crush tunables default" as per the update
instructions and since "ceph -s" was whining about it.

Lastly I did a "ceph osd pool set rbd hashpspool true" and after that was
finished (people with either a big cluster or slow network probably should
avoid this like the plague) I re-ran the below fio from a VM (old or new
client libraries made no difference) again.

The result, 2800 write IOPS instead of 3200 with Emperor.

So much for improved latency and whatnot...

Christian
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread.
I don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get
them done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.
So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09
16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved
at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's
coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Josef Johansson
2014-06-13 18:22:03 UTC
Permalink
Hey,

That sounds awful. Have you had any luck in increasing the performance?

Cheers,
Josef
Post by Christian Balzer
For what it's worth (very little in my case)...
Since the cluster wasn't in production yet and Firefly (0.80.1) did hit
Debian Jessie today I upgraded it.
Big mistake...
I did the recommended upgrade song and dance, MONs first, OSDs after that.
Then applied "ceph osd crush tunables default" as per the update
instructions and since "ceph -s" was whining about it.
Lastly I did a "ceph osd pool set rbd hashpspool true" and after that was
finished (people with either a big cluster or slow network probably should
avoid this like the plague) I re-ran the below fio from a VM (old or new
client libraries made no difference) again.
The result, 2800 write IOPS instead of 3200 with Emperor.
So much for improved latency and whatnot...
Christian
Post by Christian Balzer
Hello!
Post by Alexandre DERUMIER
Hi Christian,
I missed this thread, haven't been reading the list that well the last
weeks.
You already know my setup, since we discussed it in an earlier thread.
I don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.
Nods, I do recall that thread.
Post by Alexandre DERUMIER
A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
Post by Alexandre DERUMIER
I'll get back to you with the results, hopefully I'll manage to get
them done during this night.
Looking forward to that. ^^
Christian
Post by Alexandre DERUMIER
Cheers,
Josef
Post by Christian Balzer
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.
So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09
16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved
at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's
coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-05-08 05:44:49 UTC
Permalink
Hello,
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
No and that is neither an option nor the reason for any performance issues
here.
If you re-read my original mail it clearly states that the same fio can
achieve 110000 IOPS on that raid and that it is not busy at all during the
test.
Post by Alexandre DERUMIER
(how many disks do you have begin the raid6 ?)
11 per OSD.
This will affect the amount of sustainable IOPS of course, but in this
test case every last bit should (and does) fit into the caches.

From the RBD client the transaction should be finished once the primary
and secondary OSD for the PG in question have ACK'ed things.
Post by Alexandre DERUMIER
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
I can, but that is not the test case here.
For the record that pushes it to 12k IOPS, with the journal SSDs reaching
about 30% utilization and the actual OSDs up to 5%.
So much better, but still quite some capacity for improvement.
Post by Alexandre DERUMIER
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
I have that set of course, as well as specifically "writeback" for the KVM
instance in question.

Interestingly I see no difference at all with a KVM instance that is set
explicitly to "none", but that's not part of this particular inquiry
either.

Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Gregory Farnum" <greg at inktank.com>, ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Gregory Farnum
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals
is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS
after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^
Regards,
Christian
Post by Gregory Farnum
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Gregory Farnum
2014-05-08 05:13:53 UTC
Permalink
Oh, I didn't notice that. I bet you aren't getting the expected throughput
on the RAID array with OSD access patterns, and that's applying back
pressure on the journal.

When I suggested other tests, I meant with and without Ceph. One particular
one is OSD bench. That should be interesting to try at a variety of block
sizes. You could also try runnin RADOS bench and smalliobench at a few
different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer <chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Software Engineer #42 @ http://inktank.com | http://ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/c2e48328/attachment-0001.htm>
Christian Balzer
2014-05-08 05:44:49 UTC
Permalink
Hello,
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
No and that is neither an option nor the reason for any performance issues
here.
If you re-read my original mail it clearly states that the same fio can
achieve 110000 IOPS on that raid and that it is not busy at all during the
test.
Post by Alexandre DERUMIER
(how many disks do you have begin the raid6 ?)
11 per OSD.
This will affect the amount of sustainable IOPS of course, but in this
test case every last bit should (and does) fit into the caches.

From the RBD client the transaction should be finished once the primary
and secondary OSD for the PG in question have ACK'ed things.
Post by Alexandre DERUMIER
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
I can, but that is not the test case here.
For the record that pushes it to 12k IOPS, with the journal SSDs reaching
about 30% utilization and the actual OSDs up to 5%.
So much better, but still quite some capacity for improvement.
Post by Alexandre DERUMIER
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
I have that set of course, as well as specifically "writeback" for the KVM
instance in question.

Interestingly I see no difference at all with a KVM instance that is set
explicitly to "none", but that's not part of this particular inquiry
either.

Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Gregory Farnum" <greg at inktank.com>, ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Gregory Farnum
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals
is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS
after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^
Regards,
Christian
Post by Gregory Farnum
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-08 06:41:54 UTC
Permalink
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?

or only use by osds ?



----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices


Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}


real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s.

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}


real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622

Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.

Regards,

Christian
Post by Gregory Farnum
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-05-08 06:52:15 UTC
Permalink
Post by Alexandre DERUMIER
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?
Not a stupid question.
I made that mistake about 3 years ago in a DRBD setup, OS and activity log
SSDs on the same controller as the storage disks.
Post by Alexandre DERUMIER
or only use by osds ?
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem to
be under any particular load or pressure (utilization) according iostat
and atop during the tests.

Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and
actual utilization figures according to iostat and atop during the
tests.
But if that were to be true, how would one see if that's the case, as in
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.
And you might recall me doing rados benches as well in another thread 2
weeks ago or so.
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}
real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700)
should be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly
900% util) while the OSD is bored at around 15%. Which is no surprise,
as it can write data at up to 1600MB/s.
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}
real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622
Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.
Regards,
Christian
Post by Gregory Farnum
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved at
that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-08 06:52:15 UTC
Permalink
Post by Alexandre DERUMIER
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?
Not a stupid question.
I made that mistake about 3 years ago in a DRBD setup, OS and activity log
SSDs on the same controller as the storage disks.
Post by Alexandre DERUMIER
or only use by osds ?
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem to
be under any particular load or pressure (utilization) according iostat
and atop during the tests.

Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and
actual utilization figures according to iostat and atop during the
tests.
But if that were to be true, how would one see if that's the case, as in
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.
And you might recall me doing rados benches as well in another thread 2
weeks ago or so.
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}
real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700)
should be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly
900% util) while the OSD is bored at around 15%. Which is no surprise,
as it can write data at up to 1600MB/s.
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}
real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622
Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.
Regards,
Christian
Post by Gregory Farnum
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved at
that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-08 09:31:54 UTC
Permalink
Post by Christian Balzer
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
osd use 2 threads by default (could explain the 200%)

maybe can you try to put in ceph.conf

osd op threads = 8


(don't known how many cores you have)



----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Jeudi 8 Mai 2014 08:52:15
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Christian Balzer
Stupid question : Is your areca 4GB cache shared between ssd journal and
osd ?
Not a stupid question.
I made that mistake about 3 years ago in a DRBD setup, OS and activity log
SSDs on the same controller as the storage disks.
Post by Christian Balzer
or only use by osds ?
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem to
be under any particular load or pressure (utilization) according iostat
and atop during the tests.

Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and
actual utilization figures according to iostat and atop during the
tests.
But if that were to be true, how would one see if that's the case, as in
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail
below.
And you might recall me doing rados benches as well in another thread 2
weeks ago or so.
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}
real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700)
should be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly
900% util) while the OSD is bored at around 15%. Which is no surprise,
as it can write data at up to 1600MB/s.
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}
real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622
Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.
Regards,
Christian
Post by Gregory Farnum
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved at
that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got
lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-08 09:42:12 UTC
Permalink
Hello,
Post by Alexandre DERUMIER
Post by Christian Balzer
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
osd use 2 threads by default (could explain the 200%)
maybe can you try to put in ceph.conf
osd op threads = 8
Already at 10 (for some weeks now). ^o^

How that setting relates to the actual 220 threads per OSD process is a
mystery for another day.
Post by Alexandre DERUMIER
(don't known how many cores you have)
6.
The OSDs get busy (CPU, not IOWAIT), but there still are 1-2 cores idle at
that point.
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Jeudi 8 Mai 2014 08:52:15
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Christian Balzer
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?
Not a stupid question.
I made that mistake about 3 years ago in a DRBD setup, OS and activity
log SSDs on the same controller as the storage disks.
Post by Christian Balzer
or only use by osds ?
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem
to be under any particular load or pressure (utilization) according
iostat and atop during the tests.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and
actual utilization figures according to iostat and atop during the
tests.
But if that were to be true, how would one see if that's the case, as
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.
And you might recall me doing rados benches as well in another thread
2 weeks ago or so.
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}
real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700)
should be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly
900% util) while the OSD is bored at around 15%. Which is no surprise,
as it can write data at up to 1600MB/s.
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}
real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622
Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.
Regards,
Christian
Post by Gregory Farnum
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved
at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to
be expected or if not where all that potential performance got
lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should
reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's
coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-08 09:42:12 UTC
Permalink
Hello,
Post by Alexandre DERUMIER
Post by Christian Balzer
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
osd use 2 threads by default (could explain the 200%)
maybe can you try to put in ceph.conf
osd op threads = 8
Already at 10 (for some weeks now). ^o^

How that setting relates to the actual 220 threads per OSD process is a
mystery for another day.
Post by Alexandre DERUMIER
(don't known how many cores you have)
6.
The OSDs get busy (CPU, not IOWAIT), but there still are 1-2 cores idle at
that point.
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Jeudi 8 Mai 2014 08:52:15
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Christian Balzer
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?
Not a stupid question.
I made that mistake about 3 years ago in a DRBD setup, OS and activity
log SSDs on the same controller as the storage disks.
Post by Christian Balzer
or only use by osds ?
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem
to be under any particular load or pressure (utilization) according
iostat and atop during the tests.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and
actual utilization figures according to iostat and atop during the
tests.
But if that were to be true, how would one see if that's the case, as
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.
And you might recall me doing rados benches as well in another thread
2 weeks ago or so.
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}
real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700)
should be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly
900% util) while the OSD is bored at around 15%. Which is no surprise,
as it can write data at up to 1600MB/s.
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}
real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622
Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.
Regards,
Christian
Post by Gregory Farnum
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved
at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to
be expected or if not where all that potential performance got
lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should
reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's
coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-08 00:57:46 UTC
Permalink
Hello,

ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.

Running this fio:

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128

results in:

30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD

When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.

Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.

Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.

I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.

Regards,

Christian
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Gregory Farnum
2014-05-08 01:37:48 UTC
Permalink
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that. If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
Alexandre DERUMIER
2014-05-08 04:33:51 UTC
Permalink
Hi Christian,

Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)


Aslo, I known that direct ios can be quite slow with ceph,

maybe can you try without --direct=1

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true




----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Gregory Farnum" <greg at inktank.com>, ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Gregory Farnum
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be 256
IOs at a time, coming from different hosts over different links and
nothing changes.
Post by Gregory Farnum
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter settings
are; it might be limiting you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and bugs
and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very much
in the same (junior league) ballpark.
Post by Gregory Farnum
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
Post by Gregory Farnum
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
I think we can rule out the network, NPtcp gives me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and still
isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with IPoIB
led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really sure
what experiments I should run on that. ^o^

Regards,

Christian
Post by Gregory Farnum
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Alexandre DERUMIER
2014-05-08 06:41:54 UTC
Permalink
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ?

or only use by osds ?



----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices


Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}


real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s.

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}


real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622

Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.

Regards,

Christian
Post by Gregory Farnum
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for the
whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links and
nothing changes.
that's about 40ms of latency per op (for userspace RBD), which seems
awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops than
that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are concerned.
Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a wider
array of experiments to try and figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local fio
tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Alexandre DERUMIER
2014-05-08 09:31:54 UTC
Permalink
Post by Christian Balzer
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
osd use 2 threads by default (could explain the 200%)

maybe can you try to put in ceph.conf

osd op threads = 8


(don't known how many cores you have)



----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Jeudi 8 Mai 2014 08:52:15
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Christian Balzer
Stupid question : Is your areca 4GB cache shared between ssd journal and
osd ?
Not a stupid question.
I made that mistake about 3 years ago in a DRBD setup, OS and activity log
SSDs on the same controller as the storage disks.
Post by Christian Balzer
or only use by osds ?
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6).
I keep repeating myself, neither the journal devices nor the OSDs seem to
be under any particular load or pressure (utilization) according iostat
and atop during the tests.

Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Jeudi 8 Mai 2014 08:26:33
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Hello,
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
I doubt that based on what I see in terms of local performance and
actual utilization figures according to iostat and atop during the
tests.
But if that were to be true, how would one see if that's the case, as in
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump
is the data I'd be looking for?
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
I already did the local tests, as in w/o Ceph, see the original mail
below.
And you might recall me doing rados benches as well in another thread 2
weeks ago or so.
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "247102026.000000"}
real 0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700)
should be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly
900% util) while the OSD is bored at around 15%. Which is no surprise,
as it can write data at up to 1600MB/s.
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
"blocksize": 4096,
"bytes_per_sec": "9004316.000000"}
real 1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%.
So clearly not overtaxing either component.
But yet, this looks like 2100 IOPS to me, if my math is half right.
---
Total time run: 30.912786
Total writes made: 44490
Write size: 4096
Bandwidth (MB/sec): 5.622
Stddev Bandwidth: 3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency: 0.0444653
Stddev Latency: 0.121887
Max latency: 2.80917
Min latency: 0.001958
---
So this is even worse, just about 1500 IOPS.
Regards,
Christian
Post by Gregory Farnum
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved at
that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got
lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Udo Lembke
2014-05-08 15:20:59 UTC
Permalink
Hi,
I think not that's related, but how full is your ceph-cluster? Perhaps
it's has something to do with the fragmentation on the xfs-filesystem
(xfs_db -c frag -r device)?

Udo
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals
are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882
with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise there)
3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some
obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different hosts
changes nothing, as in the aggregated IOPS for the whole cluster will
still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be expected
or if not where all that potential performance got lost.
Regards,
Christian
Udo Lembke
2014-05-08 15:27:18 UTC
Permalink
Hi again,
sorry, too fast - but this can't be an problem due to your 4GB cache...

Udo
Post by Udo Lembke
Hi,
I think not that's related, but how full is your ceph-cluster? Perhaps
it's has something to do with the fragmentation on the xfs-filesystem
(xfs_db -c frag -r device)?
Udo
Christian Balzer
2014-05-08 15:34:29 UTC
Permalink
Hello,
Post by Udo Lembke
Hi,
I think not that's related, but how full is your ceph-cluster? Perhaps
it's has something to do with the fragmentation on the xfs-filesystem
(xfs_db -c frag -r device)?
As I wrote, this cluster will go into production next week, so it's
neither full nor fragmented.
I'd also think any severe fragmentation would show up in high device
utilization, something I stated that's not present.

In fact after all the initial testing I did defrag the OSDs a few days ago,
not that they actually needed it.
Because for starters it is ext4, not xfs, see:
https://www.mail-archive.com/ceph-users at lists.ceph.com/msg08619.html

For what it's worth, I never got an answer to the actual question in that
mail.

Christian
Post by Udo Lembke
Udo
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the journals is
about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop, but
the system is not CPU or otherwise resource starved at that moment.
Running multiple instances of this test from several VMs on different
hosts changes nothing, as in the aggregated IOPS for the whole cluster
will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is IPoIB
with the associated low latency and the journal SSDs are the
(consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Regards,
Christian
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-13 09:33:27 UTC
Permalink
Hi Christian,

I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.


Do you have tried to use 1 osd by physical disk ? (without raid6)

Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.




----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices


I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network results
below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would be
256 IOs at a time, coming from different hosts over different links
and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.
Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would be
interesting ? there's some performance work that should reduce
latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and still
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-05-13 09:51:37 UTC
Permalink
Hello,
Post by Alexandre DERUMIER
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Alexandre DERUMIER
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..." thread
by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).

That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster in
the future I'd really love to have this mystery solved.
Post by Alexandre DERUMIER
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.

Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved
at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to
be expected or if not where all that potential performance got
lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should
reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's
coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-13 11:36:49 UTC
Permalink
Post by Alexandre DERUMIER
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Just found this:

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

page12:

" Note: As of Ceph Dumpling release (10/2013), a per-OSD read performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to indicate that Ceph can make good use of spinning disks for data
storage and may benefit from SSD backed OSDs, though may also be limited on high performance SSDs."


Maybe Intank could comment about the 4000iops by osd ?


----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices


Hello,
Post by Alexandre DERUMIER
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Alexandre DERUMIER
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..." thread
by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).

That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster in
the future I'd really love to have this mystery solved.
Post by Alexandre DERUMIER
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.

Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give you
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
The journals are on (separate) DC 3700s, the actual OSDs are
RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource starved
at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the network
is IPoIB with the associated low latency and the journal SSDs
are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and to
be expected or if not where all that potential performance got
lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master would
be interesting ? there's some performance work that should
reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's
coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-13 12:38:57 UTC
Permalink
Hello,
Post by Alexandre DERUMIER
Post by Alexandre DERUMIER
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.

The CPU to OSD chart clearly assumes the OSD to be backed by spinning rust
or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly over
2 cores on the 4332HE at full speed.
Post by Alexandre DERUMIER
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to
indicate that Ceph can make good use of spinning disks for data storage
and may benefit from SSD backed OSDs, though may also be limited on high
performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size, type
of I/O (random or sequential).

For what it's worth, my cluster gives me 4100 IOPS with the sequential fio
run below and 7200 when doing random reads (go figure). Of course I made
sure these came come the pagecache of the storage nodes, no disk I/O
reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
---


Christian
Post by Alexandre DERUMIER
Maybe Intank could comment about the 4000iops by osd ?
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Alexandre DERUMIER
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Alexandre DERUMIER
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..." thread
by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster in
the future I'd really love to have this mystery solved.
Post by Alexandre DERUMIER
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Christian
Post by Alexandre DERUMIER
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at
a variety of block sizes. You could also try runnin RADOS bench
and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the actual
OSDs are RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at
2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS
for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and
to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and figure
out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the
local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-13 14:09:28 UTC
Permalink
Post by Alexandre DERUMIER
Post by Christian Balzer
For what it's worth, my cluster gives me 4100 IOPS with the sequential fio
run below and 7200 when doing random reads (go figure). Of course I made
sure these came come the pagecache of the storage nodes, no disk I/O
reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
---
This seem pretty low,

I can get around 6000iops seq or rand read,
with a pretty old cluster

3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning in ceph.conf

each node:
----------
-2x quad xeon E5430 @ 2.66GHz
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal on same disk than osd, no dedicated ssd)
-2 gigabit link (lacp)
-switch cisco 2960



each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)



sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --filename=/dev/vdb
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta 00m:00s]
fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, stdev=3341.21
lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, 50=0.23%
lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0


Run status group 0 (all jobs):
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s, mint=18404msec, maxt=18404msec


Disk stats (read/write):
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, util=99.58%


random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64 --filename=/dev/vdb
valid values: read Sequential read
: write Sequential write
: randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix


fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta 00m:01s]
fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
clat percentiles (msec):
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77, stdev=2657.48
lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%, 50=0.21%
lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0


Run status group 0 (all jobs):
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s, mint=17897msec, maxt=17897msec


Disk stats (read/write):
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492, util=99.57%






MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de pics de trafic

----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices


Hello,
Post by Alexandre DERUMIER
Post by Christian Balzer
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.

The CPU to OSD chart clearly assumes the OSD to be backed by spinning rust
or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly over
2 cores on the 4332HE at full speed.
Post by Alexandre DERUMIER
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to
indicate that Ceph can make good use of spinning disks for data storage
and may benefit from SSD backed OSDs, though may also be limited on high
performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size, type
of I/O (random or sequential).

For what it's worth, my cluster gives me 4100 IOPS with the sequential fio
run below and 7200 when doing random reads (go figure). Of course I made
sure these came come the pagecache of the storage nodes, no disk I/O
reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
---


Christian
Post by Alexandre DERUMIER
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88 Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de pics de trafic ----- Mail original -----
Post by Alexandre DERUMIER
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Hello,
Post by Christian Balzer
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian Balzer
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..." thread
by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster in
the future I'd really love to have this mystery solved.
Post by Christian Balzer
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at
a variety of block sizes. You could also try runnin RADOS bench
and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the actual
OSDs are RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at
2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on
atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS
for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and
to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side
objecter settings are; it might be limiting you to fewer
outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph devel
and bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and figure
out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput and
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy experience
with IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the
local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Christian Balzer
2014-05-13 14:39:58 UTC
Permalink
Post by Alexandre DERUMIER
Post by Christian Balzer
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course
I made sure these came come the pagecache of the storage nodes, no
disk I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
This seem pretty low,
I can get around 6000iops seq or rand read,
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
Post by Alexandre DERUMIER
with a pretty old cluster
Your cluster isn't that old (the CPUs are in the same ballpark) and has 12
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^

Remember, all this is coming from RAM, so what it boils down is CPU
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.

Either way, that number isn't anywhere near 4000 read IOPS per OSD either,
yours is about 500, mine about 1000...

Christian
Post by Alexandre DERUMIER
3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning in ceph.conf
----------
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal
on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
-switch cisco 2960
each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)
sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64 2.0.8 Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
50=0.23% lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
mint=18404msec, maxt=18404msec
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, util=99.58%
random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb valid values: read Sequential read : write
Sequential write : randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix
fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s,
mint=17897msec, maxt=17897msec
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492, util=99.57%
MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de pics de trafic
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Christian Balzer
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.
The CPU to OSD chart clearly assumes the OSD to be backed by spinning
rust or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly
over 2 cores on the 4332HE at full speed.
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to
indicate that Ceph can make good use of spinning disks for data
storage and may benefit from SSD backed OSDs, though may also be
limited on high performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size,
type of I/O (random or sequential).
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course I
made sure these came come the pagecache of the storage nodes, no disk
I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
Christian
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue
Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la webperformance
et la gestion de pics de trafic ----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Christian Balzer
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian Balzer
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..."
thread by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster
in the future I'd really love to have this mystery solved.
Post by Christian Balzer
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.
So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while
in my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and
that's applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and
service times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD
processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try
at a variety of block sizes. You could also try runnin RADOS
bench and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal
and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the actual
OSDs are RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1
--direct=1 --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of
the journals is about 20% (2400 IOPS) and the OSDs are
bored at 2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200%
on atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs
on different hosts changes nothing, as in the aggregated
IOPS for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and
to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD),
which seems awfully long. You should check what your
client-side objecter settings are; it might be limiting you
to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph
devel and bugs and nothing at all as far as configuration
options are concerned. Care to enlighten me where one can find
those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and figure
out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy
experience with IPoIB led me to believe, what else is there to
look at? The storage nodes perform just as expected, indicated
by the local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-13 15:16:25 UTC
Permalink
Post by Alexandre DERUMIER
Post by Christian Balzer
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
oops, sorry. I got around 7500iops with randread.
Post by Alexandre DERUMIER
Post by Christian Balzer
Your cluster isn't that old (the CPUs are in the same ballpark)
Yes, this is 6-7 year old server. (this xeons were released in 2007...)

So, it miss some features like crc32 and sse4 for examples, which can help a lot ceph



(I'll try to do some osd tuning (threads,...) to see if I can improve performance.


----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 16:39:58
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Alexandre DERUMIER
Post by Christian Balzer
Post by Christian Balzer
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course
I made sure these came come the pagecache of the storage nodes, no
disk I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
This seem pretty low,
I can get around 6000iops seq or rand read,
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
Post by Alexandre DERUMIER
with a pretty old cluster
Your cluster isn't that old (the CPUs are in the same ballpark) and has 12
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^

Remember, all this is coming from RAM, so what it boils down is CPU
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.

Either way, that number isn't anywhere near 4000 read IOPS per OSD either,
yours is about 500, mine about 1000...

Christian
Post by Alexandre DERUMIER
3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning
in ceph.conf
----------
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal
on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
-switch cisco 2960
each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)
sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64 2.0.8 Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
50=0.23% lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
mint=18404msec, maxt=18404msec
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380,
util=99.58%
random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb valid values: read Sequential read : write
Sequential write : randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix
fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s,
mint=17897msec, maxt=17897msec
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492,
util=99.57%
MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de
pics de trafic
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Hello,
Post by Christian Balzer
Post by Christian Balzer
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.
The CPU to OSD chart clearly assumes the OSD to be backed by spinning
rust or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly
over 2 cores on the 4332HE at full speed.
Post by Christian Balzer
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to
indicate that Ceph can make good use of spinning disks for data
storage and may benefit from SSD backed OSDs, though may also be
limited on high performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size,
type of I/O (random or sequential).
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course I
made sure these came come the pagecache of the storage nodes, no disk
I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
Christian
Post by Christian Balzer
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue
Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la webperformance
et la gestion de pics de trafic ----- Mail original -----
Post by Christian Balzer
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
Hello,
Post by Christian Balzer
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian Balzer
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..."
thread by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster
in the future I'd really love to have this mystery solved.
Post by Christian Balzer
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.
So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while
in my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and
that's applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and
service times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD
processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try
at a variety of block sizes. You could also try runnin RADOS
bench and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal
and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the actual
OSDs are RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1
--direct=1 --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of
the journals is about 20% (2400 IOPS) and the OSDs are
bored at 2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200%
on atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs
on different hosts changes nothing, as in the aggregated
IOPS for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and
to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD),
which seems awfully long. You should check what your
client-side objecter settings are; it might be limiting you
to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph
devel and bugs and nothing at all as far as configuration
options are concerned. Care to enlighten me where one can find
those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and figure
out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy
experience with IPoIB led me to believe, what else is there to
look at? The storage nodes perform just as expected, indicated
by the local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-13 16:10:25 UTC
Permalink
I have just done some test,

with fio-rbd,
(http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html)

directly from the kvm host,(not from the vm).


1 fio job: around 8000iops
2 differents parralel fio job (on different rbd volume) : around 8000iops by fio job !

cpu on client is at 100%
cpu of osd are around 70%/1core now.


So, seem to have a bottleneck client side somewhere.

(I remember some tests from Stefan Priebe on this mailing, with a full ssd cluster,
having almost same results)



----- Mail original -----

De: "Alexandre DERUMIER" <aderumier at odiso.com>
?: "Christian Balzer" <chibi at gol.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 17:16:25
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Alexandre DERUMIER
Post by Christian Balzer
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
oops, sorry. I got around 7500iops with randread.
Post by Alexandre DERUMIER
Post by Christian Balzer
Your cluster isn't that old (the CPUs are in the same ballpark)
Yes, this is 6-7 year old server. (this xeons were released in 2007...)

So, it miss some features like crc32 and sse4 for examples, which can help a lot ceph



(I'll try to do some osd tuning (threads,...) to see if I can improve performance.


----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 16:39:58
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Alexandre DERUMIER
Post by Christian Balzer
Post by Christian Balzer
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course
I made sure these came come the pagecache of the storage nodes, no
disk I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
This seem pretty low,
I can get around 6000iops seq or rand read,
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
Post by Alexandre DERUMIER
with a pretty old cluster
Your cluster isn't that old (the CPUs are in the same ballpark) and has 12
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^

Remember, all this is coming from RAM, so what it boils down is CPU
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.

Either way, that number isn't anywhere near 4000 read IOPS per OSD either,
yours is about 500, mine about 1000...

Christian
Post by Alexandre DERUMIER
3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning
in ceph.conf
----------
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal
on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
-switch cisco 2960
each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)
sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64 2.0.8 Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
50=0.23% lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
mint=18404msec, maxt=18404msec
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380,
util=99.58%
random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb valid values: read Sequential read : write
Sequential write : randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix
fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s,
mint=17897msec, maxt=17897msec
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492,
util=99.57%
MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de
pics de trafic
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Hello,
Post by Christian Balzer
Post by Christian Balzer
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.
The CPU to OSD chart clearly assumes the OSD to be backed by spinning
rust or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly
over 2 cores on the 4332HE at full speed.
Post by Christian Balzer
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to
indicate that Ceph can make good use of spinning disks for data
storage and may benefit from SSD backed OSDs, though may also be
limited on high performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size,
type of I/O (random or sequential).
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course I
made sure these came come the pagecache of the storage nodes, no disk
I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
Christian
Post by Christian Balzer
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue
Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la webperformance
et la gestion de pics de trafic ----- Mail original -----
Post by Christian Balzer
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
Hello,
Post by Christian Balzer
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian Balzer
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..."
thread by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster
in the future I'd really love to have this mystery solved.
Post by Christian Balzer
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.
So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while
in my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and
that's applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and
service times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD
processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try
at a variety of block sizes. You could also try runnin RADOS
bench and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal
and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the actual
OSDs are RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1
--direct=1 --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of
the journals is about 20% (2400 IOPS) and the OSDs are
bored at 2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200%
on atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs
on different hosts changes nothing, as in the aggregated
IOPS for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and
to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD),
which seems awfully long. You should check what your
client-side objecter settings are; it might be limiting you
to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph
devel and bugs and nothing at all as far as configuration
options are concerned. Care to enlighten me where one can find
those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and figure
out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy
experience with IPoIB led me to believe, what else is there to
look at? The storage nodes perform just as expected, indicated
by the local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Christian Balzer
2014-05-13 16:31:18 UTC
Permalink
Post by Alexandre DERUMIER
I have just done some test,
with fio-rbd,
(http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html)
directly from the kvm host,(not from the vm).
1 fio job: around 8000iops
2 differents parralel fio job (on different rbd volume) : around 8000iops by fio job !
cpu on client is at 100%
cpu of osd are around 70%/1core now.
So, seem to have a bottleneck client side somewhere.
You didn't specify what you did, but i assume you did read test.
Those scale, as in running fio in multiple VMs in parallel gives me about
6200 IOPS each, so much better than the 7200 for a single one.
And yes, the client CPU is quite busy.

However my real, original question is about writes. And they are stuck at
3200 IOPS, cluster wide, no matter how many parallel VMs are running fio...

Christian
Post by Alexandre DERUMIER
(I remember some tests from Stefan Priebe on this mailing, with a full
ssd cluster, having almost same results)
----- Mail original -----
De: "Alexandre DERUMIER" <aderumier at odiso.com>
?: "Christian Balzer" <chibi at gol.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 17:16:25
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Alexandre DERUMIER
Post by Christian Balzer
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
oops, sorry. I got around 7500iops with randread.
Post by Alexandre DERUMIER
Post by Christian Balzer
Your cluster isn't that old (the CPUs are in the same ballpark)
Yes, this is 6-7 year old server. (this xeons were released in 2007...)
So, it miss some features like crc32 and sse4 for examples, which can help a lot ceph
(I'll try to do some osd tuning (threads,...) to see if I can improve performance.
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 16:39:58
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Alexandre DERUMIER
Post by Christian Balzer
Post by Christian Balzer
For what it's worth, my cluster gives me 4100 IOPS with the
sequential fio run below and 7200 when doing random reads (go
figure). Of course I made sure these came come the pagecache of the
storage nodes, no disk I/O reported at all and the CPUs used just 1
core per OSD. ---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
This seem pretty low,
I can get around 6000iops seq or rand read,
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
Post by Alexandre DERUMIER
with a pretty old cluster
Your cluster isn't that old (the CPUs are in the same ballpark) and has
12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^
Remember, all this is coming from RAM, so what it boils down is CPU
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.
Either way, that number isn't anywhere near 4000 read IOPS per OSD
either, yours is about 500, mine about 1000...
Christian
Post by Alexandre DERUMIER
3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning
in ceph.conf
----------
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal
on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
-switch cisco 2960
each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)
sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64 2.0.8 Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
50=0.23% lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
Post by Christian Balzer
=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
mint=18404msec, maxt=18404msec
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380,
util=99.58%
random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb valid values: read Sequential read : write
Sequential write : randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix
fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
Post by Christian Balzer
=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s,
mint=17897msec, maxt=17897msec
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492,
util=99.57%
MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de
pics de trafic
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Christian Balzer
Post by Christian Balzer
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.
The CPU to OSD chart clearly assumes the OSD to be backed by spinning
rust or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly
over 2 cores on the 4332HE at full speed.
Post by Christian Balzer
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of
around 35,000 IOPS when doing reads directly from pagecache. This
appears to indicate that Ceph can make good use of spinning disks
for data storage and may benefit from SSD backed OSDs, though may
also be limited on high performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size,
type of I/O (random or sequential).
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course
I made sure these came come the pagecache of the storage nodes, no
disk I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
Christian
Post by Christian Balzer
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue
Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la
webperformance et la gestion de pics de trafic ----- Mail original
-----
Post by Christian Balzer
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Christian Balzer
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian Balzer
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..."
thread by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that
safe based on RAID6) and operational maintainability (it is a remote
data center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't
a typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster
in the future I'd really love to have this mystery solved.
Post by Christian Balzer
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a
fast network and FAST filestore, so like me with a big HW cache in
front of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS
per OSD, which is of course vastly faster than the normal
indvidual HDDs could do.
So I'm wondering if I'm hitting some inherent limitation of how
fast a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use
of RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk
would clearly benefit from even the small 32MB standard RBD cache,
while in my test case the only time the caching becomes noticeable
is if I increase the cache size to something larger than the test
data size. ^o^
On the other hand if people here regularly get thousands or tens
of thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the
expected throughput on the RAID array with OSD access
patterns, and that's applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal
SSDs. Look at these numbers, the lack of queues, the low wait
and service times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the
network results below is that the latency happens within the OSD
processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph.
One particular one is OSD bench. That should be interesting to
try at a variety of block sizes. You could also try runnin
RADOS bench and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal
and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the
actual OSDs are RAID6 behind an Areca 1882 with 4GB of
cache.
fio --size=400m --ioengine=libaio --invalidate=1
--direct=1 --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of
the journals is about 20% (2400 IOPS) and the OSDs are
bored at 2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200%
on atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs
on different hosts changes nothing, as in the aggregated
IOPS for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal
and to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD),
which seems awfully long. You should check what your
client-side objecter settings are; it might be limiting
you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph
devel and bugs and nothing at all as far as configuration
options are concerned. Care to enlighten me where one can
find those?
Also note the kernelspace (3.13 if it matters) speed, which
is very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and
figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy
experience with IPoIB led me to believe, what else is there
to look at? The storage nodes perform just as expected,
indicated by the local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
http://ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Alexandre DERUMIER
2014-05-13 16:46:36 UTC
Permalink
Post by Alexandre DERUMIER
Post by Christian Balzer
You didn't specify what you did, but i assume you did read test.
yes, indeed
Post by Alexandre DERUMIER
Post by Christian Balzer
Those scale, as in running fio in multiple VMs in parallel gives me about
6200 IOPS each, so much better than the 7200 for a single one.
And yes, the client CPU is quite busy.
oh ok !
Post by Alexandre DERUMIER
Post by Christian Balzer
However my real, original question is about writes. And they are stuck at
3200 IOPS, cluster wide, no matter how many parallel VMs are running fio...
Sorry, can test for write, don't have ssd journal for now.
I'll try to send result when I'll have my ssd cluster.

(But I remember some talk from Sage saying than indeed small direct write could be pretty slow,
that why rbd_cache is recommended, to aggregate small writes in bigger one)



----- Mail original -----

De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 18:31:18
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Post by Alexandre DERUMIER
I have just done some test,
with fio-rbd,
(http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html)
directly from the kvm host,(not from the vm).
1 fio job: around 8000iops
2 differents parralel fio job (on different rbd volume) : around
8000iops by fio job !
cpu on client is at 100%
cpu of osd are around 70%/1core now.
So, seem to have a bottleneck client side somewhere.
You didn't specify what you did, but i assume you did read test.
Those scale, as in running fio in multiple VMs in parallel gives me about
6200 IOPS each, so much better than the 7200 for a single one.
And yes, the client CPU is quite busy.

However my real, original question is about writes. And they are stuck at
3200 IOPS, cluster wide, no matter how many parallel VMs are running fio...

Christian
Post by Alexandre DERUMIER
(I remember some tests from Stefan Priebe on this mailing, with a full
ssd cluster, having almost same results)
----- Mail original -----
De: "Alexandre DERUMIER" <aderumier at odiso.com>
?: "Christian Balzer" <chibi at gol.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 17:16:25
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Post by Christian Balzer
Post by Christian Balzer
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
oops, sorry. I got around 7500iops with randread.
Post by Christian Balzer
Post by Christian Balzer
Your cluster isn't that old (the CPUs are in the same ballpark)
Yes, this is 6-7 year old server. (this xeons were released in 2007...)
So, it miss some features like crc32 and sse4 for examples, which can
help a lot ceph
(I'll try to do some osd tuning (threads,...) to see if I can improve
performance.
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 16:39:58
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
devices
Post by Christian Balzer
Post by Christian Balzer
Post by Christian Balzer
For what it's worth, my cluster gives me 4100 IOPS with the
sequential fio run below and 7200 when doing random reads (go
figure). Of course I made sure these came come the pagecache of the
storage nodes, no disk I/O reported at all and the CPUs used just 1
core per OSD. ---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
This seem pretty low,
I can get around 6000iops seq or rand read,
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.
Post by Christian Balzer
with a pretty old cluster
Your cluster isn't that old (the CPUs are in the same ballpark) and has
12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^
Remember, all this is coming from RAM, so what it boils down is CPU
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.
Either way, that number isn't anywhere near 4000 read IOPS per OSD
either, yours is about 500, mine about 1000...
Christian
Post by Christian Balzer
3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning
in ceph.conf
----------
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal
on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
-switch cisco 2960
each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)
sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64 2.0.8 Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
50=0.23% lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
Post by Christian Balzer
=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
mint=18404msec, maxt=18404msec
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380,
util=99.58%
random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb valid values: read Sequential read : write
Sequential write : randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix
fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
Post by Christian Balzer
=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
Post by Christian Balzer
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s,
mint=17897msec, maxt=17897msec
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492,
util=99.57%
MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de
pics de trafic
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
Hello,
Post by Christian Balzer
Post by Christian Balzer
Post by Christian Balzer
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.
The CPU to OSD chart clearly assumes the OSD to be backed by spinning
rust or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly
over 2 cores on the 4332HE at full speed.
Post by Christian Balzer
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of
around 35,000 IOPS when doing reads directly from pagecache. This
appears to indicate that Ceph can make good use of spinning disks
for data storage and may benefit from SSD backed OSDs, though may
also be limited on high performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size,
type of I/O (random or sequential).
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course
I made sure these came come the pagecache of the storage nodes, no
disk I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
Christian
Post by Christian Balzer
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue
Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la
webperformance et la gestion de pics de trafic ----- Mail original
-----
Post by Christian Balzer
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
Hello,
Post by Christian Balzer
Hi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian Balzer
Do you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..."
thread by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that
safe based on RAID6) and operational maintainability (it is a remote
data center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't
a typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster
in the future I'd really love to have this mystery solved.
Post by Christian Balzer
Maybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
Christian
Post by Christian Balzer
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a
fast network and FAST filestore, so like me with a big HW cache in
front of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS
per OSD, which is of course vastly faster than the normal
indvidual HDDs could do.
So I'm wondering if I'm hitting some inherent limitation of how
fast a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use
of RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk
would clearly benefit from even the small 32MB standard RBD cache,
while in my test case the only time the caching becomes noticeable
is if I increase the cache size to something larger than the test
data size. ^o^
On the other hand if people here regularly get thousands or tens
of thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian Balzer
Post by Gregory Farnum
Oh, I didn't notice that. I bet you aren't getting the
expected throughput on the RAID array with OSD access
patterns, and that's applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal
SSDs. Look at these numbers, the lack of queues, the low wait
and service times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the
network results below is that the latency happens within the OSD
processes.
Regards,
Christian
Post by Gregory Farnum
When I suggested other tests, I meant with and without Ceph.
One particular one is OSD bench. That should be interesting to
try at a variety of block sizes. You could also try runnin
RADOS bench and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIER
Hi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal
and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian Balzer
Hello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the
actual OSDs are RAID6 behind an Areca 1882 with 4GB of
cache.
fio --size=400m --ioengine=libaio --invalidate=1
--direct=1 --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of
the journals is about 20% (2400 IOPS) and the OSDs are
bored at 2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200%
on atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs
on different hosts changes nothing, as in the aggregated
IOPS for the whole cluster will still be around 3200
IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal
and to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that
correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD),
which seems awfully long. You should check what your
client-side objecter settings are; it might be limiting
you to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph
devel and bugs and nothing at all as far as configuration
options are concerned. Care to enlighten me where one can
find those?
Also note the kernelspace (3.13 if it matters) speed, which
is very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and
figure out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy
experience with IPoIB led me to believe, what else is there
to look at? The storage nodes perform just as expected,
indicated by the local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
http://ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
Loading...