[ceph-users] Persistent Write Back Cache

Discussion:

Nick Fisk

2015-03-04 08:26:52 UTC

Hi All,

Is there anything in the pipeline to add the ability to write the librbd
cache to ssd so that it can safely ignore sync requests? I have seen a
thread a few years back where Sage was discussing something similar, but I
can't find anything more recent discussing it.

I've been running lots of tests on our new cluster, buffered/parallel
performance is amazing (40K Read 10K write iops), very impressed. However
sync writes are actually quite disappointing.

Running fio with 128k block size and depth=1, normally only gives me about
300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from
what I hear that's about normal, so I don't think I have a ceph config
problem. For applications which do a lot of sync's, like ESXi over iSCSI or
SQL databases, this has a major performance impact.

Traditional storage arrays work around this problem by having a battery
backed cache which has latency 10-100 times less than what you can currently
achieve with Ceph and an SSD . Whilst librbd does have a writeback cache,
from what I understand it will not cache syncs and so in my usage case, it
effectively acts like a write through cache.

To illustrate the difference a proper write back cache can make, I put a 1GB
(512mb dirty threshold) flashcache in front of my RBD and tweaked the flush
parameters to flush dirty blocks at a large queue depth. The same fio test
(128k iodepth=1) now runs at 120MB/s and is limited by the performance of
SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In
fact since everything is stored as 4k blocks, pretty much all IO sizes are
accelerated to max speed of the SSD. Looking at iostat I can see all the
IO's are getting coalesced into nice large 512kb IO's at a high queue depth,
which Ceph easily swallows.

If librbd could support writing its cache out to SSD it would hopefully
achieve the same level of performance and having it integrated would be
really neat.

Nick

Christian Balzer

2015-03-04 08:40:11 UTC

Permalink

Hello,

If I understand you correctly, you're talking about the rbd cache on the
client side.

So assume that host or the cache SSD on if fail terminally.
The client thinks its sync'ed are on the permanent storage (the actual ceph
storage cluster), while they are only present locally.

So restarting that service or VM on a different host now has to deal with
likely crippling data corruption.

Regards,

Christian

Post by Nick Fisk
Hi All,
Is there anything in the pipeline to add the ability to write the librbd
cache to ssd so that it can safely ignore sync requests? I have seen a
thread a few years back where Sage was discussing something similar, but
I can't find anything more recent discussing it.
I've been running lots of tests on our new cluster, buffered/parallel
performance is amazing (40K Read 10K write iops), very impressed. However
sync writes are actually quite disappointing.
Running fio with 128k block size and depth=1, normally only gives me
about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's
and from what I hear that's about normal, so I don't think I have a ceph
config problem. For applications which do a lot of sync's, like ESXi
over iSCSI or SQL databases, this has a major performance impact.
Traditional storage arrays work around this problem by having a battery
backed cache which has latency 10-100 times less than what you can
currently achieve with Ceph and an SSD . Whilst librbd does have a
writeback cache, from what I understand it will not cache syncs and so
in my usage case, it effectively acts like a write through cache.
To illustrate the difference a proper write back cache can make, I put a
1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked
the flush parameters to flush dirty blocks at a large queue depth. The
same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the
performance of SSD used by flashcache, as everything is stored as 4k
blocks on the ssd. In fact since everything is stored as 4k blocks,
pretty much all IO sizes are accelerated to max speed of the SSD.
Looking at iostat I can see all the IO's are getting coalesced into nice
large 512kb IO's at a high queue depth, which Ceph easily swallows.
If librbd could support writing its cache out to SSD it would hopefully
achieve the same level of performance and having it integrated would be
really neat.
Nick

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Nick Fisk

2015-03-04 08:49:22 UTC

Permalink

Hi Christian,

Yes that's correct, it's on the client side. I don't see this much different
to a battery backed Raid controller, if you lose power, the data is in the
cache until power resumes when it is flushed.

If you are going to have the same RBD accessed by multiple servers/clients
then you need to make sure the SSD is accessible to both (eg DRBD / Dual
Port SAS). But then something like pacemaker would be responsible for
ensuring the RBD and cache device are both present before allowing client
access.

When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's,
however I can understand that this feature would prove more of a challenge
if you are using Qemu and RBD.

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
Christian Balzer
Sent: 04 March 2015 08:40
To: ceph-***@lists.ceph.com
Cc: Nick Fisk
Subject: Re: [ceph-users] Persistent Write Back Cache

Hello,

If I understand you correctly, you're talking about the rbd cache on the
client side.

So assume that host or the cache SSD on if fail terminally.
The client thinks its sync'ed are on the permanent storage (the actual ceph
storage cluster), while they are only present locally.

So restarting that service or VM on a different host now has to deal with
likely crippling data corruption.

Regards,

Christian

Post by Nick Fisk
Hi All,
Is there anything in the pipeline to add the ability to write the
librbd cache to ssd so that it can safely ignore sync requests? I have
seen a thread a few years back where Sage was discussing something
similar, but I can't find anything more recent discussing it.
I've been running lots of tests on our new cluster, buffered/parallel
performance is amazing (40K Read 10K write iops), very impressed.
However sync writes are actually quite disappointing.
Running fio with 128k block size and depth=1, normally only gives me
about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's
and from what I hear that's about normal, so I don't think I have a
ceph config problem. For applications which do a lot of sync's, like
ESXi over iSCSI or SQL databases, this has a major performance impact.
Traditional storage arrays work around this problem by having a
battery backed cache which has latency 10-100 times less than what you
can currently achieve with Ceph and an SSD . Whilst librbd does have a
writeback cache, from what I understand it will not cache syncs and so
in my usage case, it effectively acts like a write through cache.
To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of the

SSD.

Post by Nick Fisk
Looking at iostat I can see all the IO's are getting coalesced into
nice large 512kb IO's at a high queue depth, which Ceph easily swallows.
If librbd could support writing its cache out to SSD it would
hopefully achieve the same level of performance and having it
integrated would be really neat.
Nick

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Sage Weil

2015-03-04 16:33:41 UTC

Permalink

Hi Nick, Christian,

This is something we've discussed a bit but hasn't made it to the top of
the list.

I think having a single persistent copy on the client has *some* value,
although it's limited because its a single point of failure. The simplest
scenario would be to use it as a write-through cache that accellerates
reads only.

Another option would be to have a shared but local device (like an SSD
that is connected to a pair of client hosts, or has fast access within a
rack--a scenario that I've heard a few vendors talk about). It
still leaves a host pair or rack as a failure zone, but there are
times where that's appropriate.

In either case, though, I think the key RBD feature that would make it
much more valuable would be if RBD (librbd presumably) could maintain the
writeback cache with some sort of checkpoints or journal internally such
that writes that get flushed back to the cluster are always *crash
consistent*. So even if you lose the client cache entirely, your disk
image is still holding a valid file system that looks like it is just a
little bit stale.

If the client-side writeback cache were structured as a data journal this
would be pretty staightforward... it might even mesh well with the RBD
mirroring?

sage

Post by Nick Fisk
Hi Christian,
Yes that's correct, it's on the client side. I don't see this much different
to a battery backed Raid controller, if you lose power, the data is in the
cache until power resumes when it is flushed.
If you are going to have the same RBD accessed by multiple servers/clients
then you need to make sure the SSD is accessible to both (eg DRBD / Dual
Port SAS). But then something like pacemaker would be responsible for
ensuring the RBD and cache device are both present before allowing client
access.
When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's,
however I can understand that this feature would prove more of a challenge
if you are using Qemu and RBD.
Nick
-----Original Message-----
Christian Balzer
Sent: 04 March 2015 08:40
Cc: Nick Fisk
Subject: Re: [ceph-users] Persistent Write Back Cache
Hello,
If I understand you correctly, you're talking about the rbd cache on the
client side.
So assume that host or the cache SSD on if fail terminally.
The client thinks its sync'ed are on the permanent storage (the actual ceph
storage cluster), while they are only present locally.
So restarting that service or VM on a different host now has to deal with
likely crippling data corruption.
Regards,
Christian

Post by Nick Fisk
Hi All,
Is there anything in the pipeline to add the ability to write the
librbd cache to ssd so that it can safely ignore sync requests? I have
seen a thread a few years back where Sage was discussing something
similar, but I can't find anything more recent discussing it.
I've been running lots of tests on our new cluster, buffered/parallel
performance is amazing (40K Read 10K write iops), very impressed.
However sync writes are actually quite disappointing.
Running fio with 128k block size and depth=1, normally only gives me
about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's
and from what I hear that's about normal, so I don't think I have a
ceph config problem. For applications which do a lot of sync's, like
ESXi over iSCSI or SQL databases, this has a major performance impact.
Traditional storage arrays work around this problem by having a
battery backed cache which has latency 10-100 times less than what you
can currently achieve with Ceph and an SSD . Whilst librbd does have a
writeback cache, from what I understand it will not cache syncs and so
in my usage case, it effectively acts like a write through cache.
To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of the

SSD.

--
Christian Balzer Network/Systems Engineer
http://www.gol.com/
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Christian Balzer

2015-03-05 00:50:44 UTC

Permalink

Hello Nick,

Which is pretty much any and all use cases I can think about.
Because it's not only concurrent (active/active) accesses, but you
really need to have things consistent across all possible client hosts in
case of a node failure.

I'm no stranger to DRBD and Pacemaker (which incidentally didn't make it
into Debian Jessie, queue massive laughter and ridicule), btw.

Post by Nick Fisk
When I wrote this I was thinking more about 2 HA iSCSI servers with
RBD's, however I can understand that this feature would prove more of a
challenge if you are using Qemu and RBD.

One of the reasons I'm using Ceph/RBD instead of DRBD (which is vastly
more suited for some use cases) is that it allows me n+1 instead of n+n
redundancy when it comes to consumers (compute nodes in my case).

Now for your iSCSI head (looking forward to your results and any config
recipes) that limitation to a pair may be just as well, but as others
wrote it might be best to go forward with this outside of Ceph.
Especially since you're already dealing with a HA cluster/pacemaker in
that scenario.

Christian

Post by Nick Fisk
Nick
-----Original Message-----
Christian Balzer
Sent: 04 March 2015 08:40
Cc: Nick Fisk
Subject: Re: [ceph-users] Persistent Write Back Cache
Hello,
If I understand you correctly, you're talking about the rbd cache on the
client side.
So assume that host or the cache SSD on if fail terminally.
The client thinks its sync'ed are on the permanent storage (the actual
ceph storage cluster), while they are only present locally.
So restarting that service or VM on a different host now has to deal with
likely crippling data corruption.
Regards,
Christian

Post by Nick Fisk
Hi All,
Is there anything in the pipeline to add the ability to write the
librbd cache to ssd so that it can safely ignore sync requests? I have
seen a thread a few years back where Sage was discussing something
similar, but I can't find anything more recent discussing it.
I've been running lots of tests on our new cluster, buffered/parallel
performance is amazing (40K Read 10K write iops), very impressed.
However sync writes are actually quite disappointing.
Running fio with 128k block size and depth=1, normally only gives me
about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's
and from what I hear that's about normal, so I don't think I have a
ceph config problem. For applications which do a lot of sync's, like
ESXi over iSCSI or SQL databases, this has a major performance impact.
Traditional storage arrays work around this problem by having a
battery backed cache which has latency 10-100 times less than what you
can currently achieve with Ceph and an SSD . Whilst librbd does have a
writeback cache, from what I understand it will not cache syncs and so
in my usage case, it effectively acts like a write through cache.
To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of the

SSD.

Post by Nick Fisk
Looking at iostat I can see all the IO's are getting coalesced into
nice large 512kb IO's at a high queue depth, which Ceph easily
swallows.
If librbd could support writing its cache out to SSD it would
hopefully achieve the same level of performance and having it
integrated would be really neat.
Nick

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

John Spray

2015-03-04 11:34:12 UTC

Permalink

Post by Nick Fisk
To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of
the SSD. Looking at iostat I can see all the IOs are getting
coalesced into nice large 512kb IOs at a high queue depth, which Ceph
easily swallows.
If librbd could support writing its cache out to SSD it would
hopefully achieve the same level of performance and having it
integrated would be really neat.

What are you hoping to gain from building something into ceph instead of
using flashcache/bcache/dm-cache on top of it? It seems like since you
would anyway need to handle your HA configuration, setting up the actual
cache device would be the simple part.

Cheers,
John

Nick Fisk

2015-03-04 16:39:29 UTC

Permalink

From: ceph-users [mailto:ceph-users-***@lists.ceph.com] On Behalf Of
John Spray
Sent: 04 March 2015 11:34
To: Nick Fisk; ceph-***@lists.ceph.com
Subject: Re: [ceph-users] Persistent Write Back Cache

On 04/03/2015 08:26, Nick Fisk wrote:

To illustrate the difference a proper write back cache can make, I put a 1GB
(512mb dirty threshold) flashcache in front of my RBD and tweaked the flush
parameters to flush dirty blocks at a large queue depth. The same fio test
(128k iodepth=1) now runs at 120MB/s and is limited by the performance of
SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In
fact since everything is stored as 4k blocks, pretty much all IO sizes are
accelerated to max speed of the SSD. Looking at iostat I can see all the
IO's are getting coalesced into nice large 512kb IO's at a high queue depth,
which Ceph easily swallows.

If librbd could support writing its cache out to SSD it would hopefully
achieve the same level of performance and having it integrated would be
really neat.

What are you hoping to gain from building something into ceph instead of
using flashcache/bcache/dm-cache on top of it? It seems like since you
would anyway need to handle your HA configuration, setting up the actual
cache device would be the simple part.

Cheers,
John

Hi John,

I guess it's to make things easier rather than having to run a huge stack of
different technologies to achieve the same goal, especially when half of the
caching logic is already in Ceph. It would be really nice and drive adoption
if you could could add a SSD, set a config option and suddenly you have a
storage platform that performs 10x faster.

Another way of handling it might be for librbd to be pointed at a uuid
instead of a /dev/sd* device. That way librbd knows what cache device to
look for and will error out if the cache device is missing. These cache
devices could then be presented to all necessary servers via iSCSI or
something similar if the RBD will need to move around.

Nick

Mark Nelson

2015-03-04 16:42:41 UTC

Permalink

Post by John Spray

Post by Nick Fisk
To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of
the SSD. Looking at iostat I can see all the IO’s are getting
coalesced into nice large 512kb IO’s at a high queue depth, which Ceph
easily swallows.
If librbd could support writing its cache out to SSD it would
hopefully achieve the same level of performance and having it
integrated would be really neat.

Agreed regarding flashcache/bcache/dm-cache. I suspect improving an
existing project rather than reinventing it ourselves would be the way
to go. It may also be worth looking at Luis's work, though I note that
he specifically says write-through:

http://vault2015.sched.org/event/6cc56a5b8a95ead46961697028b59c39#.VPc0uX-etWQ

https://github.com/pblcache/pblcache

Post by John Spray
Cheers,
John
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com