parameter, it needs to be randread, not rand-read.
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.
yours is about 500, mine about 1000...
Post by Alexandre DERUMIER3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning in ceph.conf
----------
-4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal
on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
-switch cisco 2960
each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)
sequential
----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
ioengine=libaio, iodepth=64 2.0.8 Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404],
| 99.99th=[ 404]
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
50=0.23% lat (msec) : 250=0.13%, 500=0.06%
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
mint=18404msec, maxt=18404msec
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, util=99.58%
random read
-----------
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
--filename=/dev/vdb valid values: read Sequential read : write
Sequential write : randread Random read
: randwrite Random write
: rw Sequential read and write mix
: readwrite Sequential read and write mix
: randrw Random read and write mix
fio: failed parsing rw=rand-read
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
2.0.8
Starting 1 process
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10],
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12],
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359],
| 99.99th=[ 404]
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06%
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s,
mint=17897msec, maxt=17897msec
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492, util=99.57%
MonSiteEstLent.com - Blog d?di? ? la webperformance et la gestion de pics de trafic
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 14:38:57
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Christian BalzerPost by Christian BalzerIt might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a
setup would probably require 32+ cores.
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
That's and interesting find indeed.
The CPU to OSD chart clearly assumes the OSD to be backed by spinning
rust or doing 4MB block transactions.
As stated before, at the 4KB blocksize below one OSD eats up slightly
over 2 cores on the 4332HE at full speed.
" Note: As of Ceph Dumpling release (10/2013), a per-OSD read
performance is approximately 4,000 IOPS and a per node limit of around
35,000 IOPS when doing reads directly from pagecache. This appears to
indicate that Ceph can make good use of spinning disks for data
storage and may benefit from SSD backed OSDs, though may also be
limited on high performance SSDs."
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size,
type of I/O (random or sequential).
For what it's worth, my cluster gives me 4100 IOPS with the sequential
fio run below and 7200 when doing random reads (go figure). Of course I
made sure these came come the pagecache of the storage nodes, no disk
I/O reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4k --iodepth=64 ---
Christian
Maybe Intank could comment about the 4000iops by osd ?
Alexandre Derumier Ing?nieur syst?me et stockage Fixe : 03 20 68 90 88
Fax : 03 20 68 90 81 45 Bvd du G?n?ral Leclerc 59100 Roubaix 12 rue
Marivaux 75002 Paris MonSiteEstLent.com - Blog d?di? ? la webperformance
et la gestion de pics de trafic ----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Cc: "Alexandre DERUMIER" <aderumier at odiso.com>
Envoy?: Mardi 13 Mai 2014 11:51:37
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello,
Post by Christian BalzerHi Christian,
I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.
Looking forward to that.
Post by Christian BalzerDo you have tried to use 1 osd by physical disk ? (without raid6)
No, if you look back to the last year December "Sanity check..."
thread by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).
That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production.
But for designing a general purpose or high performance Ceph cluster
in the future I'd really love to have this mystery solved.
Post by Christian BalzerMaybe they are bottleneck in osd daemon,
and using osd daemon by disk could help.
It might, but at the IOPS I'm seeing anybody using SSD for file
storage should have screamed out already.
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores.
Christian
Post by Christian Balzer----- Mail original -----
De: "Christian Balzer" <chibi at gol.com>
?: ceph-users at lists.ceph.com
Envoy?: Mardi 13 Mai 2014 11:03:47
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front
of a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement below per
OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per
OSD, which is of course vastly faster than the normal indvidual HDDs
could do.
So I'm wondering if I'm hitting some inherent limitation of how fast
a single OSD (as in the software) can handle IOPS, given that
everything else has been ruled out from where I stand.
This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while
in my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^
On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.
Christian
Post by Christian BalzerPost by Gregory FarnumOh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and
that's applying back pressure on the journal.
In the a "picture" being worth a thousand words tradition, I give
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the 2 OSD
processes, note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and
service times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD
processes.
Regards,
Christian
Post by Gregory FarnumWhen I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try
at a variety of block sizes. You could also try runnin RADOS
bench and smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER
Post by Alexandre DERUMIERHi Christian,
Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <chibi at gol.com <javascript:;>>
?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
ceph-users at lists.ceph.com <javascript:;>
Envoy?: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal
and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<chibi at gol.com<javascript:;>>
Post by Christian BalzerHello,
ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs
each. The journals are on (separate) DC 3700s, the actual
OSDs are RAID6 behind an Areca 1882 with 4GB of cache.
fio --size=400m --ioengine=libaio --invalidate=1
--direct=1 --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k --iodepth=128
30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no
surprise there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
When running the fio from the VM RBD the utilization of
the journals is about 20% (2400 IOPS) and the OSDs are
bored at 2% (1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200%
on atop, but the system is not CPU or otherwise resource
starved at that moment.
Running multiple instances of this test from several VMs
on different hosts changes nothing, as in the aggregated
IOPS for the whole cluster will still be around 3200 IOPS.
Now clearly RBD has to deal with latency here, but the
network is IPoIB with the associated low latency and the
journal SSDs are the (consistently) fasted ones around.
I guess what I am wondering about is if this is normal and
to be expected or if not where all that potential
performance got lost.
Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that
would be 256 IOs at a time, coming from different hosts over
different links and nothing changes.
that's about 40ms of latency per op (for userspace RBD),
which seems awfully long. You should check what your
client-side objecter settings are; it might be limiting you
to fewer outstanding ops than that.
Googling for client-side objecter gives a few hits on ceph
devel and bugs and nothing at all as far as configuration
options are concerned. Care to enlighten me where one can find
those?
Also note the kernelspace (3.13 if it matters) speed, which is
very much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or even master
would be interesting ? there's some performance work that
should reduce latencies.
Not an option, this is going into production next week.
But a well-tuned (or even default-tuned, I thought) Ceph
cluster certainly doesn't require 40ms/op, so you should
probably run a wider array of experiments to try and figure
out where it's coming from.
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---
For comparison at about 512KB it reaches maximum throughput
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---
So with the network performing as well as my lengthy
experience with IPoIB led me to believe, what else is there to
look at? The storage nodes perform just as expected, indicated
by the local fio tests.
That pretty much leaves only Ceph/RBD to look at and I'm not
really sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
--
Christian Balzer Network/Systems Engineer
chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com