Discussion:
[GIT PULL] mm: frontswap (for 3.2 window)
Dan Magenheimer
2011-10-27 18:52:22 UTC
Permalink
Hi Linus --

Frontswap now has FOUR users: Two already merged in-tree (zcache
and Xen) and two still in development but in public git trees
(RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
changes required to support transcendent memory; part 1 was cleancache
which you merged at 3.0 (and which now has FIVE users).

Frontswap patches have been in linux-next since June 3 (with zero
changes since Sep 22). First posted to lkml in June 2009, frontswap
is now at version 11 and has incorporated feedback from a wide range
of kernel developers. For a good overview, see
http://lwn.net/Articles/454795.
If further rationale is needed, please see the end of this email
for more info.

SO... Please pull:

git://oss.oracle.com/git/djm/tmem.git #tmem

since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb
Linus Torvalds (1):

Linux 3.1-rc6

(identical commits being pulled by sfr into linux-next since Sep22)

Note that in addition to frontswap, this commit series includes
some minor changes to cleancache necessary for consistency with
changes for frontswap required by Andrew Morton (e.g. flush->invalidate
name change; use debugfs instead of sysfs). As a result, a handful
of cleancache-related VFS files incur only a very small change.

Dan Magenheimer (8):
mm: frontswap: add frontswap header file
mm: frontswap: core swap subsystem hooks and headers
mm: frontswap: core frontswap functionality
mm: frontswap: config and doc files
mm: cleancache: s/flush/invalidate/
mm: frontswap/cleancache: s/flush/invalidate/
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: cleancache: Use __read_mostly as appropiate.

Diffstat:
.../ABI/testing/sysfs-kernel-mm-cleancache | 11 -
Documentation/vm/cleancache.txt | 41 ++--
Documentation/vm/frontswap.txt | 210 +++++++++++++++
drivers/staging/zcache/zcache-main.c | 10 +-
drivers/xen/tmem.c | 10 +-
fs/buffer.c | 2 +-
fs/super.c | 2 +-
include/linux/cleancache.h | 24 +-
include/linux/frontswap.h | 9 +-
include/linux/swap.h | 4 +
include/linux/swapfile.h | 13 +
mm/Kconfig | 17 ++
mm/Makefile | 1 +
mm/cleancache.c | 98 +++-----
mm/filemap.c | 2 +-
mm/frontswap.c | 273 ++++++++++++++++++++
mm/page_io.c | 12 +
mm/swapfile.c | 64 ++++-
mm/truncate.c | 10 +-
19 files changed, 672 insertions(+), 141 deletions(-)

====

FURTHER RATIONALE, INFORMATION, AND LINKS:

In-kernel users (grep for CONFIG_FRONTSWAP):
- drivers/staging/zcache (since 2.6.39)
- drivers/xen/tmem.c (since 3.1)
- drivers/xen/xen-selfballoon.c (since 3.1)

Users in development in public git trees:
- "RAMster" driver, see ramster branch of
git://oss.oracle.com/git/djm/tmem.git
- KVM port now underway, see:
https://github.com/sashalevin/kvm-tmem/commits/tmem

History of frontswap code:
- code first written in Dec 2008
- previously known as "hswap" and "preswap"
- first public posting in Feb 2009
- first LKML posting on June 19, 2009
- renamed frontswap, posted on May 28, 2010
- in linux-next since June 3, 2011
- incorporated feedback from: (partial list)
Andrew Morton, Jan Beulich, Konrad Wilk,
Jeremy Fitzhardinge, Kamezawa Hiroyuki,
Seth Jennings (IBM)

Linux kernel distros incorporating frontswap:
- Oracle UEK 2.6.39 Beta:
http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary
- OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c]
http://kernel.opensuse.org/cgit/kernel/
- a popular Gentoo distro
http://forums.gentoo.org/viewtopic-t-862105.html

Xen distros supporting Linux guests with frontswap:
- Xen hypervisor backend since Xen 4.0 (2009)
http://www.xen.org/files/Xen_4_0_Datasheet.pdf
- OracleVM since 2.2 (2009)
http://twitter.com/#!/Djelibeybi/status/113876514688352256

Public visibility for frontswap (as part of transcendent memory):
- presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle
Open World 2011, two LSF/MM Summits (2010,2011), and three
Xen Summits (2009,2010,2011)
- http://lwn.net/Articles/454795 (current overview)
- http://lwn.net/Articles/386090 (2010)
- http://lwn.net/Articles/340080 (2009)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Kurt Hackel
2011-10-27 19:30:20 UTC
Permalink
Hi,
As the dev manager for OracleVM (x86) I'd like to express my interest in seeing frontswap get merged upstream. The OracleVM product has been capable of working with frontswap for over a year now, and we'd very much like to see the complete cleancache+frontswap feature set fully upstreamed. Oracle is also fully committed to the ongoing maintenance of frontswap.
thanks
kurt

Kurt C. Hackel
Development Director
Oracle VM
***@oracle.com


On 10/27/2011 11:52 AM, Dan Magenheimer wrote:
> Hi Linus --
>
> Frontswap now has FOUR users: Two already merged in-tree (zcache
> and Xen) and two still in development but in public git trees
> (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> changes required to support transcendent memory; part 1 was cleancache
> which you merged at 3.0 (and which now has FIVE users).
>
> Frontswap patches have been in linux-next since June 3 (with zero
> changes since Sep 22). First posted to lkml in June 2009, frontswap
> is now at version 11 and has incorporated feedback from a wide range
> of kernel developers. For a good overview, see
> http://lwn.net/Articles/454795.
> If further rationale is needed, please see the end of this email
> for more info.
>
> SO... Please pull:
>
> git://oss.oracle.com/git/djm/tmem.git #tmem
>
> since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb
> Linus Torvalds (1):
>
> Linux 3.1-rc6
>
> (identical commits being pulled by sfr into linux-next since Sep22)
>
> Note that in addition to frontswap, this commit series includes
> some minor changes to cleancache necessary for consistency with
> changes for frontswap required by Andrew Morton (e.g. flush->invalidate
> name change; use debugfs instead of sysfs). As a result, a handful
> of cleancache-related VFS files incur only a very small change.
>
> Dan Magenheimer (8):
> mm: frontswap: add frontswap header file
> mm: frontswap: core swap subsystem hooks and headers
> mm: frontswap: core frontswap functionality
> mm: frontswap: config and doc files
> mm: cleancache: s/flush/invalidate/
> mm: frontswap/cleancache: s/flush/invalidate/
> mm: cleancache: report statistics via debugfs instead of sysfs.
> mm: cleancache: Use __read_mostly as appropiate.
>
> Diffstat:
> .../ABI/testing/sysfs-kernel-mm-cleancache | 11 -
> Documentation/vm/cleancache.txt | 41 ++--
> Documentation/vm/frontswap.txt | 210 +++++++++++++++
> drivers/staging/zcache/zcache-main.c | 10 +-
> drivers/xen/tmem.c | 10 +-
> fs/buffer.c | 2 +-
> fs/super.c | 2 +-
> include/linux/cleancache.h | 24 +-
> include/linux/frontswap.h | 9 +-
> include/linux/swap.h | 4 +
> include/linux/swapfile.h | 13 +
> mm/Kconfig | 17 ++
> mm/Makefile | 1 +
> mm/cleancache.c | 98 +++-----
> mm/filemap.c | 2 +-
> mm/frontswap.c | 273 ++++++++++++++++++++
> mm/page_io.c | 12 +
> mm/swapfile.c | 64 ++++-
> mm/truncate.c | 10 +-
> 19 files changed, 672 insertions(+), 141 deletions(-)
>
> ====
>
> FURTHER RATIONALE, INFORMATION, AND LINKS:
>
> In-kernel users (grep for CONFIG_FRONTSWAP):
> - drivers/staging/zcache (since 2.6.39)
> - drivers/xen/tmem.c (since 3.1)
> - drivers/xen/xen-selfballoon.c (since 3.1)
>
> Users in development in public git trees:
> - "RAMster" driver, see ramster branch of
> git://oss.oracle.com/git/djm/tmem.git
> - KVM port now underway, see:
> https://github.com/sashalevin/kvm-tmem/commits/tmem
>
> History of frontswap code:
> - code first written in Dec 2008
> - previously known as "hswap" and "preswap"
> - first public posting in Feb 2009
> - first LKML posting on June 19, 2009
> - renamed frontswap, posted on May 28, 2010
> - in linux-next since June 3, 2011
> - incorporated feedback from: (partial list)
> Andrew Morton, Jan Beulich, Konrad Wilk,
> Jeremy Fitzhardinge, Kamezawa Hiroyuki,
> Seth Jennings (IBM)
>
> Linux kernel distros incorporating frontswap:
> - Oracle UEK 2.6.39 Beta:
> http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary
> - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c]
> http://kernel.opensuse.org/cgit/kernel/
> - a popular Gentoo distro
> http://forums.gentoo.org/viewtopic-t-862105.html
>
> Xen distros supporting Linux guests with frontswap:
> - Xen hypervisor backend since Xen 4.0 (2009)
> http://www.xen.org/files/Xen_4_0_Datasheet.pdf
> - OracleVM since 2.2 (2009)
> http://twitter.com/#!/Djelibeybi/status/113876514688352256
>
> Public visibility for frontswap (as part of transcendent memory):
> - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle
> Open World 2011, two LSF/MM Summits (2010,2011), and three
> Xen Summits (2009,2010,2011)
> - http://lwn.net/Articles/454795 (current overview)
> - http://lwn.net/Articles/386090 (2010)
> - http://lwn.net/Articles/340080 (2009)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David Rientjes
2011-10-27 20:18:40 UTC
Permalink
On Thu, 27 Oct 2011, Dan Magenheimer wrote:

> Hi Linus --
>
> Frontswap now has FOUR users: Two already merged in-tree (zcache
> and Xen) and two still in development but in public git trees
> (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> changes required to support transcendent memory; part 1 was cleancache
> which you merged at 3.0 (and which now has FIVE users).
>
> Frontswap patches have been in linux-next since June 3 (with zero
> changes since Sep 22). First posted to lkml in June 2009, frontswap
> is now at version 11 and has incorporated feedback from a wide range
> of kernel developers. For a good overview, see
> http://lwn.net/Articles/454795.
> If further rationale is needed, please see the end of this email
> for more info.
>
> SO... Please pull:
>
> git://oss.oracle.com/git/djm/tmem.git #tmem
>

Isn't this something that should go through the -mm tree?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Hellwig
2011-10-27 21:11:57 UTC
Permalink
On Thu, Oct 27, 2011 at 01:18:40PM -0700, David Rientjes wrote:
> Isn't this something that should go through the -mm tree?

It should have. It should also have ACKs from the core VM developers,
and at least the few I talked to about it really didn't seem to like it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Avi Miller
2011-10-27 21:44:17 UTC
Permalink
Hi Linus et al,

If further support is required:

On 28/10/2011, at 5:52 AM, Dan Magenheimer wrote:

> Linux kernel distros incorporating frontswap:
> - Oracle UEK 2.6.39 Beta:

I have been testing this kernel for a while now as well and is performing well. I have tested Xen HVM, HVPVM and PVM guests all with tmem enabled. Automated testing is scheduled to go into our test farm (that runs ~80,000 hours of QA of testing of Oracle products on Oracle Linux per day) soon.

> - OracleVM since 2.2 (2009)

Likewise. We are planning to incorporate Transcendent Memory support into future Oracle VM 3.0 releases as support functionality, i.e. that this will be enabled on a per-server/per-guest basis so that guests are capable of reducing memory footprint. We see this as a critical feature to compete with other hypervisor's memory sharing/de-duplication functionality.

Thanks,
Avi

---
Oracle <http://www.oracle.com>
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-27 21:49:31 UTC
Permalink
> From: Christoph Hellwig [mailto:***@infradead.org]
> Sent: Thursday, October 27, 2011 3:12 PM
> To: David Rientjes
> Cc: Dan Magenheimer; Linus Torvalds; linux-***@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy
> Fitzhardinge; Seth Jennings; ***@vflare.org; ***@gmail.com; Chris Mason;
> ***@novell.com; Dave Hansen; Jonathan Corbet; Neo Jia
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Thu, Oct 27, 2011 at 01:18:40PM -0700, David Rientjes wrote:
> > Isn't this something that should go through the -mm tree?
>
> It should have. It should also have ACKs from the core VM developers,
> and at least the few I talked to about it really didn't seem to like it.

Yes, it would have been nice to have it go through the -mm tree.
But, *sigh*, I guess it will be up to Linus again to decide if
"didn't seem to like it" is sufficient to block functionality
that has found use by a number of in-kernel users and by
real shipping products... and continues to grow in usefulness.

If Linux truly subscribes to the "code rules" mantra, no core
VM developer has proposed anything -- even a design, let alone
working code -- that comes close to providing the functionality
and flexibility that frontswap (and cleancache) provides, and
frontswap provides it with a very VERY small impact on existing
kernel code AND has been posted and working for 2+ years.
(And during that 2+ years, excellent feedback has improved the
"kernel-ness" of the code, but NONE of the core frontswap
design/hooks have changed... because frontswap _just works_!)

Perhaps other frontswap users would be so kind as to reply
on this thread with their opinions...

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Hellwig
2011-10-27 21:52:43 UTC
Permalink
On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote:
> If Linux truly subscribes to the "code rules" mantra, no core
> VM developer has proposed anything -- even a design, let alone
> working code -- that comes close to providing the functionality
> and flexibility that frontswap (and cleancache) provides, and
> frontswap provides it with a very VERY small impact on existing
> kernel code AND has been posted and working for 2+ years.
> (And during that 2+ years, excellent feedback has improved the
> "kernel-ness" of the code, but NONE of the core frontswap
> design/hooks have changed... because frontswap _just works_!)

It might work for whatever defintion of work, but you certainly couldn't
convince anyone that matters that it's actually sexy and we'd actually
need it. Only actually working on Xen of course doesn't help.

In the end it's a bunch of really ugly hooks over core code, without
a clear defintion of how they work or a killer use case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Sasha Levin
2011-10-28 07:12:36 UTC
Permalink
On Thu, 2011-10-27 at 17:52 -0400, Christoph Hellwig wrote:
> On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote:
> > If Linux truly subscribes to the "code rules" mantra, no core
> > VM developer has proposed anything -- even a design, let alone
> > working code -- that comes close to providing the functionality
> > and flexibility that frontswap (and cleancache) provides, and
> > frontswap provides it with a very VERY small impact on existing
> > kernel code AND has been posted and working for 2+ years.
> > (And during that 2+ years, excellent feedback has improved the
> > "kernel-ness" of the code, but NONE of the core frontswap
> > design/hooks have changed... because frontswap _just works_!)
>
> It might work for whatever defintion of work, but you certainly couldn't
> convince anyone that matters that it's actually sexy and we'd actually
> need it. Only actually working on Xen of course doesn't help.

Theres a working POC of it on KVM, mostly based on reusing in-kernel Xen
code.

I felt it would be difficult to try and merge any tmem KVM patches until
both frontswap and cleancache are in the kernel, thats why the
development is currently paused at the POC level.

--

Sasha.
Cyclonus J
2011-10-28 07:30:10 UTC
Permalink
On Fri, Oct 28, 2011 at 12:12 AM, Sasha Levin <***@gmail.com> wrote:
> On Thu, 2011-10-27 at 17:52 -0400, Christoph Hellwig wrote:
>> On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote:
>> > If Linux truly subscribes to the "code rules" mantra, no core
>> > VM developer has proposed anything -- even a design, let alone
>> > working code -- that comes close to providing the functionality
>> > and flexibility that frontswap (and cleancache) provides, and
>> > frontswap provides it with a very VERY small impact on existing
>> > kernel code AND has been posted and working for 2+ years.
>> > (And during that 2+ years, excellent feedback has improved the
>> > "kernel-ness" of the code, but NONE of the core frontswap
>> > design/hooks have changed... because frontswap _just works_!)
>>
>> It might work for whatever defintion of work, but you certainly couldn't
>> convince anyone that matters that it's actually sexy and we'd actually
>> need it.  Only actually working on Xen of course doesn't help.
>
> Theres a working POC of it on KVM, mostly based on reusing in-kernel Xen
> code.
>
> I felt it would be difficult to try and merge any tmem KVM patches until
> both frontswap and cleancache are in the kernel, thats why the
> development is currently paused at the POC level.

Same here. I am working a KVM support for Transcedent Memory as well.
It would be nice to see this in the mainline.

Thanks,
CJ

>
> --
>
> Sasha.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pekka Enberg
2011-10-28 14:26:24 UTC
Permalink
On Fri, Oct 28, 2011 at 10:30 AM, Cyclonus J <***@gmail.com> wrote:
>> I felt it would be difficult to try and merge any tmem KVM patches until
>> both frontswap and cleancache are in the kernel, thats why the
>> development is currently paused at the POC level.
>
> Same here. I am working a KVM support for Transcedent Memory as well.
> It would be nice to see this in the mainline.

We don't really merge code for future projects - especially when it
touches the core kernel.

As for the frontswap patches, there's pretty no ACKs from MM people
apart from one Reviewed-by from Andrew. I really don't see why the
pull request is sent directly to Linus...

Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-28 16:37:25 UTC
Permalink
> You are changing core kernel code without ACKs from relevant
> maintainers. That's very unfortunate. Existing users certainly matter
> but that doesn't mean you get to merge code without maintainers even
> looking at it.
>
> So really, why don't you just use scripts/get_maintainer.pl and simply
> ask the relevant people for their ACK?

Actually I had done that before posting the patches and,
doing it now again, I *do* have many of the relevant people
on the ack list, and nearly all on the cc list of the
patch postings. (I apologize that I see I missed you
on my list.)

I think every relevant maintainer has had the chance to
review and acknowledge but some have, for whatever reason,
chosen not to.

> Looking at your patches, there's no trace that anyone outside your own
> development team even looked at the patches.

Hmmm... I have reviews/acks from IBM, Fujitsu, and Citrix (and
a long list of documented Cc's) in the git comments, so I'm
not sure what you are seeing.

Ah, perhaps you are referring to the naming changes in the
cleancache hooks? Akpm required me to rename various frontswap
hooks to use "invalidate" in the function name instead of
"flush". I took the opportunity to rename the cleancache
hooks for consistency in this same patchset and this occurred
in only in the most recent version of the patchset. It is true
that I didn't ask for Ack's from those maintainers, though
these changes would probably have gone through the trivial
patch monkey later anyway.

> Why do you feel that it's OK to ask Linus to pull them?

Frontswap is essentially the second half of the cleancache
patchset (or, more accurately, both are halves of the
transcendent memory patchset). They are similar in that
the hooks in core MM code are fairly trivial and the
real value/functionality lies outside of the core kernel;
as a result core MM maintainers don't have much interest
I guess.

Linus personally merged cleancache for 3.0 (quoting from his
offlist email: "I've looked through it, and it seems simple
enough, with a pretty minimal support burden"); I was assuming
a similar path for frontswap.

I repeat that I'm not trying to subvert any process. There
just doesn't seem to be much of a process in place for this kind
of a patchset, and I'm not letting silence or indifference
or "don't like it much" get in the way.

Thanks,
Dan
Pekka Enberg
2011-10-28 16:59:58 UTC
Permalink
On Fri, Oct 28, 2011 at 7:37 PM, Dan Magenheimer
<***@oracle.com> wrote:
>> Why do you feel that it's OK to ask Linus to pull them?
>
> Frontswap is essentially the second half of the cleancache
> patchset (or, more accurately, both are halves of the
> transcendent memory patchset). =A0They are similar in that
> the hooks in core MM code are fairly trivial and the
> real value/functionality lies outside of the core kernel;
> as a result core MM maintainers don't have much interest
> I guess.

I would not call this commit trivial:

http://oss.oracle.com/git/djm/tmem.git/?p=3Ddjm/tmem.git;a=3Dcommitdiff=
;h=3D6ce5607c1edf80f168d1e1f22dc7a85290cf094a

You are exporting bunch of mm/swapfile.c variables (including locks)
and adding hooks to mm/page_io.c and mm/swapfile.c. Furthermore, code
like this:

> + if (frontswap) {
> + if (frontswap_test(si, i))
> + break;
> + else
> + continue;
> + }

does not really help your case.

Pekka
Dan Magenheimer
2011-10-28 15:21:31 UTC
Permalink
> From: Pekka Enberg [mailto:***@kernel.org]
>
> On Fri, Oct 28, 2011 at 10:30 AM, Cyclonus J <***@gmail.com> wrote:
> >> I felt it would be difficult to try and merge any tmem KVM patches until
> >> both frontswap and cleancache are in the kernel, thats why the
> >> development is currently paused at the POC level.
> >
> > Same here. I am working a KVM support for Transcedent Memory as well.
> > It would be nice to see this in the mainline.
>
> We don't really merge code for future projects - especially when it
> touches the core kernel.

Hi Pekka --

If you grep the 3.1 source for CONFIG_FRONTSWAP, you will find
two users already in-kernel waiting for frontswap to be merged.
I think Sasha and Neo (and Brian and Nitin and ...) are simply
indicating that there can be more, but there is a chicken-and-egg
problem that can best be resolved by merging the (really very small
and barely invasive) frontswap patchset.

> As for the frontswap patches, there's pretty no ACKs from MM people
> apart from one Reviewed-by from Andrew. I really don't see why the
> pull request is sent directly to Linus...

Has there not been ample opportunity (in 2-1/2 years) for other
MM people to contribute? I'm certainly not trying to subvert any
useful technical discussion and if there is some documented MM process
I am failing to follow, please point me to it. But there are
real users and real distros and real products waiting, so if there
are any real issues, let's get them resolved.

Thanks,
Dan

P.S. before commenting further, I suggest that you read the
background material at http://lwn.net/Articles/454795/
(with an open mind :-).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pekka Enberg
2011-10-28 15:36:03 UTC
Permalink
Hi Dan,

On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
<***@oracle.com> wrote:
> If you grep the 3.1 source for CONFIG_FRONTSWAP, you will find
> two users already in-kernel waiting for frontswap to be merged.
> I think Sasha and Neo (and Brian and Nitin and ...) are simply
> indicating that there can be more, but there is a chicken-and-egg
> problem that can best be resolved by merging the (really very small
> and barely invasive) frontswap patchset.

Yup, I was referring to the two external projects. I also happen to
think that only Xen matters because zcache is in staging. So that's
one user in the tree.

On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
<***@oracle.com> wrote:
>> As for the frontswap patches, there's pretty no ACKs from MM people
>> apart from one Reviewed-by from Andrew. I really don't see why the
>> pull request is sent directly to Linus...
>
> Has there not been ample opportunity (in 2-1/2 years) for other
> MM people to contribute? =A0I'm certainly not trying to subvert any
> useful technical discussion and if there is some documented MM proces=
s
> I am failing to follow, please point me to it. =A0But there are
> real users and real distros and real products waiting, so if there
> are any real issues, let's get them resolved.

You are changing core kernel code without ACKs from relevant
maintainers. That's very unfortunate. Existing users certainly matter
but that doesn't mean you get to merge code without maintainers even
looking at it.

Looking at your patches, there's no trace that anyone outside your own
development team even looked at the patches. Why do you feel that it's
OK to ask Linus to pull them?

> P.S. before commenting further, I suggest that you read the
> background material at http://lwn.net/Articles/454795/
> (with an open mind :-).

I'm not for or against frontswap. I assume we need something like that
since Xen and KVM folks are interested. That doesn't mean you get a
free pass to add more complexity to the VM.

So really, why don't you just use scripts/get_maintainer.pl and simply
ask the relevant people for their ACK?

Pekka
Johannes Weiner
2011-10-28 16:30:53 UTC
Permalink
On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote:
> On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
> <***@oracle.com> wrote:
> >> As for the frontswap patches, there's pretty no ACKs from MM people
> >> apart from one Reviewed-by from Andrew. I really don't see why the
> >> pull request is sent directly to Linus...
> >
> > Has there not been ample opportunity (in 2-1/2 years) for other
> > MM people to contribute?  I'm certainly not trying to subvert any
> > useful technical discussion and if there is some documented MM process
> > I am failing to follow, please point me to it.  But there are
> > real users and real distros and real products waiting, so if there
> > are any real issues, let's get them resolved.
>
> You are changing core kernel code without ACKs from relevant
> maintainers. That's very unfortunate. Existing users certainly matter
> but that doesn't mean you get to merge code without maintainers even
> looking at it.
>
> Looking at your patches, there's no trace that anyone outside your own
> development team even looked at the patches. Why do you feel that it's
> OK to ask Linus to pull them?

People did look at it.

In my case, the handwavy benefits did not convince me. The handwavy
'this is useful' from just more people of the same company does not
help, either.

I want to see a usecase that tangibly gains from this, not just more
marketing material. Then we can talk about boring infrastructure and
adding hooks to the VM.

Convincing the development community of the problem you are trying to
solve is the undocumented part of the process you fail to follow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Pekka Enberg
2011-10-28 17:01:43 UTC
Permalink
On Fri, Oct 28, 2011 at 7:30 PM, Johannes Weiner <***@redhat.com> wrote:
> People did look at it.
>
> In my case, the handwavy benefits did not convince me.  The handwavy
> 'this is useful' from just more people of the same company does not
> help, either.
>
> I want to see a usecase that tangibly gains from this, not just more
> marketing material.  Then we can talk about boring infrastructure and
> adding hooks to the VM.
>
> Convincing the development community of the problem you are trying to
> solve is the undocumented part of the process you fail to follow.

Indeed. I don't also understand why this is useful nor am I convinced
enough to actually try to figure out how to do the swapfile hooks
cleanly.

Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-28 20:19:01 UTC
Permalink
> From: John Stoffel [mailto:***@stoffel.org]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
>
> >>>>> "Dan" == Dan Magenheimer <***@oracle.com> writes:
>
> Dan> Second, have you read http://lwn.net/Articles/454795/ ?
> Dan> If not, please do. If yes, please explain what you don't
> Dan> see as convincing or tangible or documented. All of this
> Dan> exists today as working publicly available code... it's
> Dan> not marketing material.
>
> I was vaguely interested, so I went and read the LWN article, and it
> didn't really provide any useful information on *why* this is such a
> good idea.

Hi John --

Thanks for taking the time to read the LWN article and sending
some feedback. I admit that, after being immersed in the
topic for three years, it's difficult to see it from the
perspective of a new reader, so I apologize if I may have
left out important stuff. I hope you'll take the time
to read this long reply.

"WHY" this is such a good idea is the same as WHY it is
useful to add RAM to your systems. Tmem expands the amount
of useful "space" available to a memory-constrained kernel
either via compression (transparent to the rest of the kernel
except for the handful of hooks for cleancache and frontswap,
using zcache) or via memory that was otherwise not visible
to the kernel (hypervisor memory from Xen or KVM, or physical
RAM on another clustered system using RAMster). Since a
kernel always eats memory until it runs out (and then does
its best to balance that maximum fixed amount), this is actually
much harder than it sounds.

So I'm asking: Is that not clear from the LWN article? Or
do you not believe that more "space" is a good idea? Or
do you not believe that tmem mitigates that problem?

Clearly if you always cram enough RAM into your system
so that you never have a paging/swapping problem (i.e your
RAM is always greater than your "working set"), tmem's
NOT a good idea. So the built-in assumption is that
RAM is a constrained resource. Increasingly (especially
in virtual machines, but elsewhere as well), this is true.

> Particularly, I didn't see any before/after numbers which compared the
> kernel running various loads both with and without these
> transcendental memory patches applied. And of course I'd like to see
> numbers when they patches are applied, but there's no TM
> (Transcendental Memory) in actual use, so as to quantify the overhead.

Actually there is. But the only serious performance analysis
has been on Xen, and I get reamed every time I use that word,
so I'm a bit gun-shy. If you are seriously interested and
willing to ignore that X-word, see the last few slides of:

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf

There's some argument about whether the value will be as
high for KVM, but that obviously can't be measured until
there is a complete KVM implementation, which requires
frontswap.

It would be nice to also have some numbers for zcache, I agree.

> Your article would also be helped with a couple of diagrams showing
> how this really helps. Esp in the cases where the system just
> endlessly says "no" to all TM requests and the kernel or apps need to
> them fall back to the regular paths.

The "no" cases occur whenever there is NO additional memory,
so obviously it doesn't help for those cases; the appropriate
question for those cases is "how much does it hurt" and the
answer is (usually) effectively zero. Again if you know
you've always got enough RAM to exceed your working set,
don't enable tmem/frontswap/cleancache.

For the "does really help" cases, I apologize, but I just
can't think how to diagrammatically show clearly that having
more RAM is a good thing.

> In my case, $WORK is using linux with large memory to run EDA
> simulations, so if we swap, performance tanks and we're out of luck.
> So for my needs, I don't see how this helps.

Do you know what percent of your total system cost is spent
on RAM, including variable expense such as power/cooling?
Is reducing that cost relevant to your $WORK? Or have
you ever ran into a "buy more RAM" situation where you couldn't
expand because your machine RAM slots were maxed out?

> For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS
> file service to two or three clients (not counting the VMs which mount
> home dirs from there as well) as well as some light WWW developement
> and service. How would TM benefit me? I don't use Xen, don't want to
> play with it honestly because I'm busy enough as it is, and I just
> don't see the hard benefits.

(I use "tmem" since TM means "trademark" to many people.)

Does 8GB always cover the sum of the working sets of all your
KVM VMs? If so, tmem won't help. If a VM in your workload
sometimes spikes, tmem allows that spike to be statistically
"load balanced" across RAM claimed by other VMs which may be
idle or have a temporarily lower working set. This means less
paging/swapping and better sum-over-all-VMs performance.

> So the onus falls on *you* and the other TM developers to sell this
> code and it's benefits (and to acknowledge it's costs) to the rest of
> the Kernel developers, esp those who hack on the VM. If you can't
> come up with hard numbers and good examples with good numbers, then

Clearly there's a bit of a chicken-and-egg problem. Frontswap
(and cleancache) are the foundation, and it's hard to build
anything solid without a foundation.

For those who "hack on the VM", I can't imagine why the handful
of lines in the swap subsystem, which is probably the most stable
and barely touched subsystem in Linux or any OS on the planet,
is going to be a burden or much of a cost.

> you're out of luck.

Another way of looking at it is that the open source
community is out of luck. Tmem IS going into real shipping
distros, but it (and Xen support and zcache and KVM support and
cool things like RAMster) probably won't be in the distro "you"
care about because this handful of nearly innocuous frontswap hooks
didn't get merged. I'm trying to be a good kernel citizen
but I can't make people listen who don't want to.

Frontswap is the last missing piece. Why so much resistance?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
John Stoffel
2011-10-28 20:52:28 UTC
Permalink
>>>>> "Dan" == Dan Magenheimer <***@oracle.com> writes:

>> From: John Stoffel [mailto:***@stoffel.org]
>> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
>>
>> >>>>> "Dan" == Dan Magenheimer <***@oracle.com> writes:
>>
Dan> Second, have you read http://lwn.net/Articles/454795/ ?
Dan> If not, please do. If yes, please explain what you don't
Dan> see as convincing or tangible or documented. All of this
Dan> exists today as working publicly available code... it's
Dan> not marketing material.
>>
>> I was vaguely interested, so I went and read the LWN article, and it
>> didn't really provide any useful information on *why* this is such a
>> good idea.

Dan> Thanks for taking the time to read the LWN article and sending
Dan> some feedback. I admit that, after being immersed in the topic
Dan> for three years, it's difficult to see it from the perspective of
Dan> a new reader, so I apologize if I may have left out important
Dan> stuff. I hope you'll take the time to read this long reply.

Will do. But I'm not the person you need to convince here about the
usefulness of this code and approach, it's the core VM developers,
since they're the ones who will have to understand this stuff and know
how to maintain it. And keeping this maintainable is a key goal.

Dan> "WHY" this is such a good idea is the same as WHY it is useful to
Dan> add RAM to your systems.

So why would I use this instead of increasing the physical RAM? Yes,
it's an easier thing to do by just installing a new kernel an flipping
on the switch, but give me numbers showing an improvement.

Dan> Tmem expands the amount of useful "space" available to a
Dan> memory-constrained kernel either via compression (transparent to
Dan> the rest of the kernel except for the handful of hooks for
Dan> cleancache and frontswap, using zcache)

Ok, so why not just a targetted swap compression function instead?
Why is your method superior?

Dan> or via memory that was otherwise not visible to the kernel
Dan> (hypervisor memory from Xen or KVM, or physical RAM on another
Dan> clustered system using RAMster).

This needs more explaining, because I'm not sure I get your
assumptions here. For example, from reading your LWN article, I see
that one idea of RAMster is to use another systems memory if you run
low. Ideally when hooked up via something like Myrinet or some other
highspeed/low latency connection. And you do say it works over plane
ethernet. Great, show me the numbers! Show me the speedup of the
application(s) you've been testing.

Dan> Since a kernel always eats memory until it runs out (and then
Dan> does its best to balance that maximum fixed amount), this is
Dan> actually much harder than it sounds.

Yes, it is. I've been running into this issue myself on RHEL5.5 VNC
servers which are loaded down with lots of user sessions. If someone
kicks in a cp of a large multi-gig file on an NFS mount point, the box
slams to a halt. This is the kind of things I think you need to
address and make sure you don't slow down.

Dan> So I'm asking: Is that not clear from the LWN article? Or
Dan> do you not believe that more "space" is a good idea? Or
Dan> do you not believe that tmem mitigates that problem?

The article doesn't give me a good diagram showing the memory layouts
and how you optimize/compress/share memory. And it also doesn't
compare performance to just increasing physical memory instead of your
approach.

Dan> Clearly if you always cram enough RAM into your system so that
Dan> you never have a paging/swapping problem (i.e your RAM is always
Dan> greater than your "working set"), tmem's NOT a good idea.

This is a statement that you should be making right up front. And
explaining why this is still a good idea to implement. I can see that
if I've got a large system which cannot physically use any more
memory, then it might be worth my while to use TMEM to get more
performance out of this expensive hardware. But if I've got the room,
why is your method better than just adding RAM?

Dan> So the built-in assumption is that RAM is a constrained resource.
Dan> Increasingly (especially in virtual machines, but elsewhere as
Dan> well), this is true.

Here's another place where you didn't explain yourself well, and where
a diagram would help. If you have a VM server with 16Gb of RAM, does
TMEM allow you to run more guests (each of which takes 2G of RAM say)
verus before? And what's the performance gain/loss/tradeoff?

>> Particularly, I didn't see any before/after numbers which compared the
>> kernel running various loads both with and without these
>> transcendental memory patches applied. And of course I'd like to see
>> numbers when they patches are applied, but there's no TM
>> (Transcendental Memory) in actual use, so as to quantify the overhead.

Dan> Actually there is. But the only serious performance analysis has
Dan> been on Xen, and I get reamed every time I use that word, so I'm
Dan> a bit gun-shy. If you are seriously interested and willing to
Dan> ignore that X-word, see the last few slides of:

I'm not that interested in Xen myself for various reasons, mostly
because it's not something I use at $WORK, and it's not something I've
spent any time playing with at $HOME in my free time.

Dan> http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf

Dan> There's some argument about whether the value will be as
Dan> high for KVM, but that obviously can't be measured until
Dan> there is a complete KVM implementation, which requires
Dan> frontswap.

Dan> It would be nice to also have some numbers for zcache, I agree.

It's not nice, it's REQUIRED. If you can't show numbers which give an
improvement, then why would it be accepted?

>> Your article would also be helped with a couple of diagrams showing
>> how this really helps. Esp in the cases where the system just
>> endlessly says "no" to all TM requests and the kernel or apps need to
>> them fall back to the regular paths.

Dan> The "no" cases occur whenever there is NO additional memory,
Dan> so obviously it doesn't help for those cases; the appropriate
Dan> question for those cases is "how much does it hurt" and the
Dan> answer is (usually) effectively zero. Again if you know
Dan> you've always got enough RAM to exceed your working set,
Dan> don't enable tmem/frontswap/cleancache.

Dan> For the "does really help" cases, I apologize, but I just can't
Dan> think how to diagrammatically show clearly that having more RAM
Dan> is a good thing.

>> In my case, $WORK is using linux with large memory to run EDA
>> simulations, so if we swap, performance tanks and we're out of luck.
>> So for my needs, I don't see how this helps.

Dan> Do you know what percent of your total system cost is spent on
Dan> RAM, including variable expense such as power/cooling?

Nope, can't quantify it unfortunately.

Dan> Is reducing that cost relevant to your $WORK? Or have you ever
Dan> ran into a "buy more RAM" situation where you couldn't expand
Dan> because your machine RAM slots were maxed out?

Generally, my engineers can and will take all the RAM they can, since
EDA simulations almost always work better with more RAM, esp as the
designs grow in size. But it's also not a hard and fast rule. If a
144Gb box with dual CPUs and 4 cores each costs me $20k or so, then
the power/cooling costs aren't as big a concern, because my enginees
*time* is where the real cost comes from. And my customers turn
around time to get a design done is another big $$$ center. The
hardware is cheap. Have you priced EDA licenses from Cadence,
Synopsys, or other vendors?

But that's besides the point. How much overhead does TMEM incur when
it's not being used, but when it's avaiable?

>> For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS
>> file service to two or three clients (not counting the VMs which mount
>> home dirs from there as well) as well as some light WWW developement
>> and service. How would TM benefit me? I don't use Xen, don't want to
>> play with it honestly because I'm busy enough as it is, and I just
>> don't see the hard benefits.

Dan> (I use "tmem" since TM means "trademark" to many people.)

Yeah, I like your phrase better too, I just got tired of typing the
full thing.

Dan> Does 8GB always cover the sum of the working sets of all your KVM
Dan> VMs? If so, tmem won't help. If a VM in your workload sometimes
Dan> spikes, tmem allows that spike to be statistically "load
Dan> balanced" across RAM claimed by other VMs which may be idle or
Dan> have a temporarily lower working set. This means less
Dan> paging/swapping and better sum-over-all-VMs performance.

So this is a good thing to show and get hard numbers on.

>> So the onus falls on *you* and the other TM developers to sell this
>> code and it's benefits (and to acknowledge it's costs) to the rest of
>> the Kernel developers, esp those who hack on the VM. If you can't
>> come up with hard numbers and good examples with good numbers, then

Dan> Clearly there's a bit of a chicken-and-egg problem. Frontswap
Dan> (and cleancache) are the foundation, and it's hard to build
Dan> anything solid without a foundation.

No one is stopping you from building your own house using the Linux
foundation, showing that it's a great house and then allowing you to
come and re-work the foundations and walls, etc to build the better
house.

Dan> For those who "hack on the VM", I can't imagine why the handful
Dan> of lines in the swap subsystem, which is probably the most stable
Dan> and barely touched subsystem in Linux or any OS on the planet,
Dan> is going to be a burden or much of a cost.

It's the performance and cleanliness aspects that people worry about.

>> you're out of luck.

Dan> Another way of looking at it is that the open source community is
Dan> out of luck. Tmem IS going into real shipping distros, but it
Dan> (and Xen support and zcache and KVM support and cool things like
Dan> RAMster) probably won't be in the distro "you" care about because
Dan> this handful of nearly innocuous frontswap hooks didn't get
Dan> merged. I'm trying to be a good kernel citizen but I can't make
Dan> people listen who don't want to.

No real skin off my nose, because I haven't seen a compelling reason
to use TMEM. And if I do run a large Oracle system, with lots of DBs
and table spaces, I don't see how TMEM helps me either, because the
hardware is such a small part of the cost of a large Oracle
deployment. Adding RAM is cheap. TMEM... well it could be useful in
an emergency, but unless it's stressed and used alot, it could end up
causing more problems than it solves.


Dan> Frontswap is the last missing piece. Why so much resistance?

Because you haven't sold it well with numbers to show how much
overhead it has?

I'm being negative because I see now reason to use it. And because I
think you can do a better job of selling it and showing the benefits
with real numbers.

Load of a XEN box, have a VM spike it's memory usage and show how TMEM
helps. Compare it to a non-TMEM setup with the same load.

John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-30 21:50:01 UTC
Permalink
> From: Dave Hansen [mailto:***@linux.vnet.ibm.com]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)

Thanks Dave (I think ;-) for chiming in.

> On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote:
> > > since they're the ones who will have to understand this stuff and know
> > > how to maintain it. And keeping this maintainable is a key goal.
> >
> > Absolutely agree. Count the number of frontswap lines that affect
> > the current VM core code and note also how they are very clearly
> > identified. It really is a very VERY small impact to the core VM
> > code (e.g. in the files swapfile.c and page_io.c).
>
> Granted, the impact on the core VM in lines of code is small. But, I
> think the behavioral impact is potentially huge since tmem's hooks add
> non-trivial amounts of framework underneath the VM in core paths. In
> zcache's case, this means a bunch of allocations and an entirely new
> allocator memory allocator being used in the swap paths.

True BUT (and this is a big BUT) it ONLY affects the core VM
path if both CONFIG_FRONTSWAP=y AND if a "tmem backend" such as
zcache registers it. So not only is the code maintenance
impact very VERY small (which you granted), but there is
no impact on users or distros or products that don't turn it
on. I also should repeat that the core VM changes introduced
by frontswap have remained essentially identical since first
proposed circa 2.6.18... the impacted swap code is NOT frequently-
changing code. My point in my "Absolutely agree" above, is
that the maintenance burden to core VM developers is low.

> We're certainly still shaking bugs out of the interactions there like
> with zcache_direct_reclaim_lock. Granted, that's not a
> tmem/frontswap/cleancache bug, but it does speak to the difficulty and
> subtlety of writing one of those frameworks underneath the tmem API.

IMHO, that's coming perilously close to saying "we don't accept
code that has bugs in it". How many significant pieces of functionality
have been added to the kernel EVER where there were NO bugs found in
the next few months? How much MERGED functionality (such as new
filesystems) has gone into the kernel years before it was broadly deployed?

Zcache is currently a staging driver for a reason... I admit it...
I wrote zcache in a couple of months (and mostly over the holidays)
and it was really the first major Linux kernel driver I'd done.
I was surprised as hell when GregKH took it into staging. But
it works pretty darn well. Why? Because it is built on the
foundation of cleancache and frontswap, which _just work_!!
And Seth Jennings (also of IBM for those that don't know) has been
doing a great job of finding and fixing bottlenecks, as well as
looking at some interesting enhancements. I think he found ONE bug
so far... because I hadn't tested on 32-bit highmem machines.
Clearly, Seth and IBM see some value in zcache (perhaps, as Ed
Tomlinson pointed out, because AIX has similar capability?)

But let's not forget that there would be no zcache for Seth or
IBM to work on if you hadn't already taken the frontswap patchset
into your tree. Frontswap is an ENABLER for zcache, as well as
for Xen tmem, for RAMster and (soon according to two kernel developers)
possibly also for KVM. Given the tiny maintenance cost, why
not merge it?

So if you are saying that frontswap is not quite ready to be
merged, fine, I can accept that. But there are now a number
of features, developers, distros, and products depending on it,
so there's a few of us who would like to hear CONCRETE STEPS
we need to achieve to make it ready. (John Stoffel is the only
one to suggest any... not counting documentation he didn't
read, the big one is getting some measurements to show zcache
is valuable. Hoping Seth can help with that?)

Got any suggestions?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
V***@vt.edu
2011-11-06 22:32:54 UTC
Permalink
On Fri, 28 Oct 2011 16:52:28 EDT, John Stoffel said:
> Dan> "WHY" this is such a good idea is the same as WHY it is useful to
> Dan> add RAM to your systems.
>
> So why would I use this instead of increasing the physical RAM?

You're welcome to buy me a new laptop that has a third DIMM slot. :)

There's a lot of people running hardware that already has the max amount of
supported RAM, and who for budget or legacy-support reasons can't easily do a
forklift upgrade to a new machine.

> if I've got a large system which cannot physically use any more
> memory, then it might be worth my while to use TMEM to get more
> performance out of this expensive hardware.

It's not always a large system....
Ed Tomlinson
2011-11-08 12:15:04 UTC
Permalink
On Sunday 06 November 2011 17:32:54 ***@vt.edu wrote:
> On Fri, 28 Oct 2011 16:52:28 EDT, John Stoffel said:
> > Dan> "WHY" this is such a good idea is the same as WHY it is useful to
> > Dan> add RAM to your systems.
> >
> > So why would I use this instead of increasing the physical RAM?
>
> You're welcome to buy me a new laptop that has a third DIMM slot. :)
>
> There's a lot of people running hardware that already has the max amount of
> supported RAM, and who for budget or legacy-support reasons can't easily do a
> forklift upgrade to a new machine.

I've got three boxes with this problem here. Hense my support for frontswap/cleancache.

Ed

> > if I've got a large system which cannot physically use any more
> > memory, then it might be worth my while to use TMEM to get more
> > performance out of this expensive hardware.
>
> It's not always a large system....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
James Bottomley
2011-10-31 08:12:47 UTC
Permalink
On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote:
> For those who "hack on the VM", I can't imagine why the handful
> of lines in the swap subsystem, which is probably the most stable
> and barely touched subsystem in Linux or any OS on the planet,
> is going to be a burden or much of a cost.

Saying things like this doesn't encourage anyone to trust you. The
whole of the MM is a complex, highly interacting system. The recent
issues we've had with kswapd and the shrinker code gives a nice
demonstration of this ... and that was caused by well tested code
updates. You can't hand wave away the need for benchmarks and
performance tests.

You have also answered all questions about inactive cost by saying "the
code has zero cost when it's compiled out" This also is a non starter.
For the few use cases it has, this code has to be compiled in. I
suspect even Oracle isn't going to ship separate frontswap and
non-frontswap kernels in its distro. So you have to quantify what the
performance impact is when this code is compiled in but not used.
Please do so.

James
Dan Magenheimer
2011-11-01 18:10:28 UTC
Permalink
> From: James Bottomley [mailto:***@HansenPartnership.com]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote:
> > > From: James Bottomley [mailto:***@HansenPartnership.com]
> > > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
> >
> > > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote:
> > > > For those who "hack on the VM", I can't imagine why the handful
> > > > of lines in the swap subsystem, which is probably the most stable
> > > > and barely touched subsystem in Linux or any OS on the planet,
> > > > is going to be a burden or much of a cost.
> > >
> > > Saying things like this doesn't encourage anyone to trust you. The
> > > whole of the MM is a complex, highly interacting system. The recent
> > > issues we've had with kswapd and the shrinker code gives a nice
> > > demonstration of this ... and that was caused by well tested code
> > > updates.
> >
> > I do understand that. My point was that the hooks are
> > placed _statically_ in largely stable code so it's not
> > going to constantly get in the way of VM developers
> > adding new features and fixing bugs, particularly
> > any developers that don't care about whether frontswap
> > works or not. I do think that is a very relevant
> > point about maintenance... do you disagree?
>
> Well, as I've said, all the mm code is highly interacting, so I don't
> really see it as "stable" in the way you suggest. What I'm saying is
> that you need to test a variety of workloads to demonstrate there aren't
> any nasty interactions.

I guess I don't understand how there can be any interactions
at all, let alone _nasty_ interactions when there is no
code to interact with?

For clarity and brevity, let's call the three cases:

Case A) CONFIG_FRONTSWAP=n
Case B) CONFIG_FRONTSWAP=y and no tmem backend registers
Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register

There are no interactions in Case A, agreed? I'm not sure
if it is clear, but in Case B every hook checks to
see if a tmem backend is registered... if not, the
hook is a no-op except for the addition of a
compare-pointer-against-NULL op, so there is no
interaction there either.

So the only case where interactions are possible is
Case C, which currently only can occur if a user
specifies a kernel boot parameter of "tmem" or "zcache".
(I know, a bit ugly, but there's a reason for doing
it this way, at least for now.)

> > Runtime interactions can only occur if the code is
> > config'ed and, if config'ed, only if a tmem backend (e.g.
> > Xen or zcache) enables it also at runtime.
>
> So this, I don't accept without proof ... that's what we initially said
> about the last set of shrinker updates that caused kswapd to hang
> sandybridge systems ...

This makes me think that you didn't understand the
code underlying Case B above, true?

> > When
> > both are enabled, runtime interactions do occur
> > and absolutely must be fully tested. My point was
> > that any _users_ who don't care about whether frontswap
> > works or not don't need to have any concerns about
> > VM system runtime interactions. I think this is also
> > a very relevant point about maintenance... do you
> > disagree?
>
> I'm sorry, what point about maintenance?

The point is that only Case C has possible interactions
so Case A and Case B end-users and kernel developers need
not worry about the maintenance.

IOW, if Johannes merges some super major swap subsystem rewrite
and he doesn't have a clue if/how to move the frontswap
hooks, his patch doesn't affect any Case A or Case B users
and not even any Case C users that aren't using latest upstream.

That seems relevant to me when we are discussing
how much maintenance cost frontswap requires which,
I think, was where this subthread started several
emails ago :-)

> > > You can't hand wave away the need for benchmarks and
> > > performance tests.
> >
> > I'm not. Conclusive benchmarks are available for one user
> > (Xen) but not (yet) for other users. I've already acknowledged
> > the feedback desiring benchmarking for zcache, but zcache
> > is already merged (albeit in staging), and Xen tmem
> > is already merged in both Linux and the Xen hypervisor,
> > and cleancache (the alter ego of frontswap) is already
> > merged.
>
> The test results for Xen I've seen are simply that "we're faster than
> swapping to disk, and we can be even better if you use self ballooning".
> There's no indication (at least in the Xen Summit presentation) what the
> actual workloads were.
>
> > So the question is not whether benchmarks are waived,
> > but whether one accepts (1) conclusive benchmarks for Xen;
> > PLUS (2) insufficiently benchmarked zcache; PLUS (3) at
> > least two other interesting-but-not-yet-benchmarkable users;
> > as sufficient for adding this small set of hooks into
> > swap code.
>
> That's the point: even for Xen, the benchmarks aren't "conclusive".
> There may be a workload for which transcendent memory works better, but
> make -j8 isn't enough of a variety of workloads)

OK, you got me, I guess "conclusive" is too strong a word.
It would be more accurate to say that the theoretical basis
for improvement, which some people were very skeptical about,
measures to be even better than expected.

I agree that one workload isn't enough... I can assure you that
there have been others. But I really don't think you are asking
for more _positive_ data, you are asking if there is _negative_
data. As you point out "we" are faster than swapping is not
a hard bar to clear. IOW comparing any workload that swaps a lot
against the same workload swapping a lot less, doesn't really
prove anything. OR DOES IT? Considering that reducing swapping
is the WHOLE POINT of frontswap, I would argue that it does.

Can we agree that if frontswap is doing its job properly on
any "normal" workload that is swapping, it is improving on a
bad situation?

Then let's get back to your implied question about _negative_
data. As described above there is NO impact for Case A
and Case B. (The zealot will point out that a pointer-compare
against-NULL per page-swapped-in/out is not "NO" impact,
but let's ignore him for now.) In Case C, there are
demonstrated benefits for SOME workloads... will frontswap
HARM some workloads?

I have openly admitted that for _cleancache_ on _zcache_,
sometimes the cost can exceed the benefits, and this was
actually demonstrated by one user on lkml. For _frontswap_
it's really hard to imagine even a very contrived workload
where frontswap fails to provide an advantage. I suppose
maybe if your swap disk lives on a PCI SSD and your CPU
is ancient single-core which does extremely slow copying
and compression?

IOW, I feel like you are giving me busywork, and any additional
evidence I present you will wave away anyway.

> > I understand that some kernel developers (mostly from one
> > company) continue to completely discount Xen, and
> > thus won't even look at the Xen results. IMHO
> > that is mudslinging.
>
> OK, so lets look at this another way: one of the signs of a good ABI is
> generic applicability. Any good virtualisation ABI should thus work for
> all virtualisation systems (including VMware should they choose to take
> advantage of it). The fact that transcendent memory only seems to work
> well for Xen is a red flag in this regard.

I think the tmem ABI will work fine with any virtualization system,
and particularly frontswap will. There are some theoretical arguments
that KVM will get little or no benefit, but those arguments
pertain primarily to cleancache. And I've noted that the ABI
was designed to be very extensible, so if KVM wants a batching
interface, they can add one. To repeat from the LWN KS2011 report:

"[Linus] stated that, simply, code that actually is used is
code that is actually worth something... code aimed at
solving the same problem is just a vague idea that is
worthless by comparison... Even if it truly is crap,
we've had crap in the kernel before. The code does not
get better out of tree."

AND the API/ABI clearly supports other non-virtualization uses
as well. The in-kernel hooks are very simple and the layering
is very clean. The ABI is extensible, has been published for
nearly three years, and successfully rev'ed once (to accomodate
192-bit exportfs handles for cleancache). Your arguments are on
very thin ice here.

It sounds like you are saying that unless/until KVM has a completed
measurable implementation... and maybe VMware and Hyper/V as well...
you don't think the tiny set of hooks that are frontswap should
be merged. If so, that "red flag" sounds self-serving, not what I
would expect from someone like you. Sorry.

> So what I don't like about this style of argument is the sleight of
> hand: I would expect the inactive but configured case to show mostly in
> the shrinker paths, which is where our major problems have been, so that
> would be cleancache, not frontswap, wouldn't it?

Yes, this is cleancache (already merged). As described
above, frontswap executes no code in Case A or Case B so
can't possibly interact with the shrinker path.

> > So the remaining question is the performance impact when
> > compile-time AND runtime enabled; this is in the published
> > Xen presentation I've referenced -- the impact is much much
> > less than the performance gain. IMHO benchmark results can
> > be easily manipulated so I prefer to discuss the theoretical
> > underpinnings which, in short, is that just about anything
> > a tmem backend does (hypercall, compression, deduplication,
> > even moving data across a fast network) is a helluva lot
> > faster than swapping a page to disk.
> >
> > Are there corner cases and probably even real workloads
> > where the cost exceeds the benefits? Probably... though
> > less likely for frontswap than for cleancache because ONLY
> > pages that would actually be swapped out/in use frontswap.
> >
> > But I have never suggested that every kernel should always
> > unconditionally compile-time-enable and run-time-enable
> > frontswap... simply that it should be in-tree so those
> > who wish to enable it are able to enable it.
>
> In practise, most useful ABIs end up being compiled in ... and useful
> basically means useful to any constituency, however small. If your ABI
> is useless, then fine, we don't have to worry about the configured but
> inactive case (but then again, we wouldn't have to worry about the ABI
> at all). If it has a use, then kernels will end up shipping with it
> configured in which is why the inactive performance impact is so
> important to quantify.

So do you now understand/agree that the inactive performance is zero
and the interaction of an inactive configuration with the remainder
of the MM subsystem is zero? And that you and your users will be
completely unaffected unless you/they intentionally turn it on,
not only compiled in, but explicitly at runtime as well?

So... understanding your preference for more workloads and your
preference that KVM should be demonstrated as a profitable user
first... is there anything else that you think should stand
in the way of merging frontswap so that existing and planned
kernel developers can build on top of it in-tree?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-28 17:20:27 UTC
Permalink
> From: Pekka Enberg [mailto:***@kernel.org]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Fri, Oct 28, 2011 at 7:37 PM, Dan Magenheimer
> <***@oracle.com> wrote:
> >> Why do you feel that it's OK to ask Linus to pull them?
> >
> > Frontswap is essentially the second half of the cleancache
> > patchset (or, more accurately, both are halves of the
> > transcendent memory patchset).  They are similar in that
> > the hooks in core MM code are fairly trivial and the
> > real value/functionality lies outside of the core kernel;
> > as a result core MM maintainers don't have much interest
> > I guess.
>
> I would not call this commit trivial:
>
> http://oss.oracle.com/git/djm/tmem.git/?p=djm/tmem.git;a=commitdiff;h=6ce5607c1edf80f168d1e1f22dc7a852
> 90cf094a
>
> You are exporting bunch of mm/swapfile.c variables (including locks)
> and adding hooks to mm/page_io.c and mm/swapfile.c.

Oh, good, some real patch discussion! :-)

You'll note that those exports previously were global and
were made static in the recent past. The rationale for
this is discussed in the FAQ in frontswap.txt which is
part of the patchset.

The swapfile.c changes are really the meat of the patch.
The page_io.c hooks ARE trivial, don't you think?

> Furthermore, code
> like this:
>
> > + if (frontswap) {
> > + if (frontswap_test(si, i))
> > + break;
> > + else
> > + continue;
> > + }
>
> does not really help your case.

I don't like that much either, but I didn't see a better way
to write it without duplicating a bunch of rather obtuse
code. Suggestions welcome.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-28 17:07:12 UTC
Permalink
> From: Johannes Weiner [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote:
> > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
> > <***@oracle.com> wrote:
> > Looking at your patches, there's no trace that anyone outside your own
> > development team even looked at the patches. Why do you feel that it's
> > OK to ask Linus to pull them?
>
> People did look at it.
>
> In my case, the handwavy benefits did not convince me. The handwavy
> 'this is useful' from just more people of the same company does not
> help, either.
>
> I want to see a usecase that tangibly gains from this, not just more
> marketing material. Then we can talk about boring infrastructure and
> adding hooks to the VM.
>
> Convincing the development community of the problem you are trying to
> solve is the undocumented part of the process you fail to follow.

Hi Johannes --

First, there are several companies and several unaffiliated kernel
developers contributing here, building on top of frontswap. I happen
to be spearheading it, and my company is backing me up. (It
might be more appropriate to note that much of the resistance comes
from people of your company... but please let's keep our open-source
developer hats on and have a technical discussion rather than one
which pleases our respective corporate overlords.)

Second, have you read http://lwn.net/Articles/454795/ ?
If not, please do. If yes, please explain what you don't
see as convincing or tangible or documented. All of this
exists today as working publicly available code... it's
not marketing material.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
John Stoffel
2011-10-28 18:28:20 UTC
Permalink
>>>>> "Dan" == Dan Magenheimer <***@oracle.com> writes:

Dan> Second, have you read http://lwn.net/Articles/454795/ ?
Dan> If not, please do. If yes, please explain what you don't
Dan> see as convincing or tangible or documented. All of this
Dan> exists today as working publicly available code... it's
Dan> not marketing material.

I was vaguely interested, so I went and read the LWN article, and it
didn't really provide any useful information on *why* this is such a
good idea.

Particularly, I didn't see any before/after numbers which compared the
kernel running various loads both with and without these
transcendental memory patches applied. And of course I'd like to see
numbers when they patches are applied, but there's no TM
(Transcendental Memory) in actual use, so as to quantify the overhead.

Your article would also be helped with a couple of diagrams showing
how this really helps. Esp in the cases where the system just
endlessly says "no" to all TM requests and the kernel or apps need to
them fall back to the regular paths.

In my case, $WORK is using linux with large memory to run EDA
simulations, so if we swap, performance tanks and we're out of luck.
So for my needs, I don't see how this helps.

For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS
file service to two or three clients (not counting the VMs which mount
home dirs from there as well) as well as some light WWW developement
and service. How would TM benefit me? I don't use Xen, don't want to
play with it honestly because I'm busy enough as it is, and I just
don't see the hard benefits.

So the onus falls on *you* and the other TM developers to sell this
code and it's benefits (and to acknowledge it's costs) to the rest of
the Kernel developers, esp those who hack on the VM. If you can't
come up with hard numbers and good examples with good numbers, then
you're out of luck.

Thanks,
John

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-30 19:18:56 UTC
Permalink
> From: John Stoffel [mailto:***@stoffel.org]
> Dan> Thanks for taking the time to read the LWN article and sending
> Dan> some feedback. I admit that, after being immersed in the topic
> Dan> for three years, it's difficult to see it from the perspective of
> Dan> a new reader, so I apologize if I may have left out important
> Dan> stuff. I hope you'll take the time to read this long reply.
>
> Will do. But I'm not the person you need to convince here about the
> usefulness of this code and approach, it's the core VM developers,

True, but you are the one providing useful suggestions while
the core VM developers are mostly silent (except for saying things
like "don't like it much"). So thank you for your feedback
and for taking the time to provide it and for indulging my replies.

I/we will need to act on your suggestions, but I need to
answer a couple of points/questions you've raised.

> since they're the ones who will have to understand this stuff and know
> how to maintain it. And keeping this maintainable is a key goal.

Absolutely agree. Count the number of frontswap lines that affect
the current VM core code and note also how they are very clearly
identified. It really is a very VERY small impact to the core VM
code (e.g. in the files swapfile.c and page_io.c).

(And it's worth noting, and I'm not arguing that it is conclusive,
just relevant, that my company has stood up and claimed responsibility
to maintain it.)

> Ok, so why not just a targetted swap compression function instead?
> Why is your method superior?

The designer/implementor of zram (which is the closest thing to
"targetted swap compression" in the kernel today) has stated
elsewhere on this thread that frontswap has advantages
over his own zram code.

And the frontswap patchset (did I mention how small the impact is?)
provides a lot more than just a foundation for compression (zcache).

> But that's besides the point. How much overhead does TMEM incur when
> it's not being used, but when it's avaiable?

This is answered in frontswap.txt in the patchset, but:

ZERO overhead if CONFIG_FRONTSWAP=n. All the hooks compile into no-ops.

If CONFIG_FRONTSWAP=y and no "tmem backend" registers to use it at
runtime, the overhead is one "compare pointer against NULL" for
every page actually swapped in or out, which is about as close to ZERO
overhead as any code can be.

If CONFIG_FRONTSWAP=y AND a "tmem backend" does register, the
answer depends on which tmem backend and what it is doing (and
yes I agree more numbers are needed), but the overhead is
incurred only in the case where a page would otherwise have
actually been swapped in or out and can replace the horrible
cost of swapping pages.

> Dan> Frontswap is the last missing piece. Why so much resistance?
>
> Because you haven't sold it well with numbers to show how much
> overhead it has?
>
> I'm being negative because I see no reason to use it. And because I
> think you can do a better job of selling it and showing the benefits
> with real numbers.

In your environment where RAM is essentially infinite, and swapping
never occurs, I agree there would be no reason for you to enable it.
In which case there is no overhead to you.

Received loud and clear on the "need more real numbers" though
personally I don't have any machines with more than 4GB RAM so
I won't personally be testing any EDA environments with 144GB :-}

So, in the context of "costs nothing if you don't need it and has
very VERY small core code impact", and given that various kernel
developers and real users and real distros and real products say
on this thread that they DO need it, and given that there
are "some" real numbers (for one user, Xen, and agree that some
are needed for zcache)... and assuming that the core VM developers
bother to read the documentation already provided that addresses
the above, let me ask again...

Why so much resistance?

Thanks,
Dan

Oops, one more (but I have to use the X-word)...

> Load up a XEN box, have a VM spike it's memory usage and show how TMEM
> helps. Compare it to a non-TMEM setup with the same load.

Yep, that's what the presentation URL I provided (for Xen) measures.
Overcommitment (more VMs than otherwise could fit in the physical
RAM) AND about a 8% performance improvement on all VMs doing
a kernel compile simultaneously. Pretty impressive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dave Hansen
2011-10-30 20:06:02 UTC
Permalink
On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote:
> > since they're the ones who will have to understand this stuff and know
> > how to maintain it. And keeping this maintainable is a key goal.
>
> Absolutely agree. Count the number of frontswap lines that affect
> the current VM core code and note also how they are very clearly
> identified. It really is a very VERY small impact to the core VM
> code (e.g. in the files swapfile.c and page_io.c).

Granted, the impact on the core VM in lines of code is small. But, I
think the behavioral impact is potentially huge since tmem's hooks add
non-trivial amounts of framework underneath the VM in core paths. In
zcache's case, this means a bunch of allocations and an entirely new
allocator memory allocator being used in the swap paths.

We're certainly still shaking bugs out of the interactions there like
with zcache_direct_reclaim_lock. Granted, that's not a
tmem/frontswap/cleancache bug, but it does speak to the difficulty and
subtlety of writing one of those frameworks underneath the tmem API.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Rik van Riel
2011-11-02 19:45:41 UTC
Permalink
On 10/30/2011 04:06 PM, Dave Hansen wrote:
> On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote:
>>> since they're the ones who will have to understand this stuff and know
>>> how to maintain it. And keeping this maintainable is a key goal.
>>
>> Absolutely agree. Count the number of frontswap lines that affect
>> the current VM core code and note also how they are very clearly
>> identified. It really is a very VERY small impact to the core VM
>> code (e.g. in the files swapfile.c and page_io.c).
>
> Granted, the impact on the core VM in lines of code is small. But, I
> think the behavioral impact is potentially huge since tmem's hooks add
> non-trivial amounts of framework underneath the VM in core paths. In
> zcache's case, this means a bunch of allocations and an entirely new
> allocator memory allocator being used in the swap paths.

My only real behaviour concern with tmem is that
/proc/sys/overcommit_memory will no longer be able
to do anything useful, since we'll never know in
advance how much memory is available.

That may be outweighed by the benefits of having
more memory available than before, and a reasonable
tradeoff to make for the users.

That leaves us with having the code cleaned up to
reasonable standards. To be honest, I would rather
have larger hooks in the existing mm code, than
exported variables and having the hooks live elsewhere
(where people changing the "normal" mm code won't see
it, and are more likely to break it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 20:45:52 UTC
Permalink
> From: Rik van Riel [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On 10/30/2011 04:06 PM, Dave Hansen wrote:
> > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote:
> >>> since they're the ones who will have to understand this stuff and know
> >>> how to maintain it. And keeping this maintainable is a key goal.
> >>
> >> Absolutely agree. Count the number of frontswap lines that affect
> >> the current VM core code and note also how they are very clearly
> >> identified. It really is a very VERY small impact to the core VM
> >> code (e.g. in the files swapfile.c and page_io.c).
> >
> > Granted, the impact on the core VM in lines of code is small. But, I
> > think the behavioral impact is potentially huge since tmem's hooks add
> > non-trivial amounts of framework underneath the VM in core paths. In
> > zcache's case, this means a bunch of allocations and an entirely new
> > allocator memory allocator being used in the swap paths.
>
> My only real behaviour concern with tmem is that
> /proc/sys/overcommit_memory will no longer be able
> to do anything useful, since we'll never know in
> advance how much memory is available.

True, for Case C (as defined in James Bottomley subthread).
For Case A and Case B (ie. no tmem backend enabled),
end-users can still rely on that existing mechanism,
so they have a choice.

> That may be outweighed by the benefits of having
> more memory available than before, and a reasonable
> tradeoff to make for the users.
>
> That leaves us with having the code cleaned up to
> reasonable standards. To be honest, I would rather
> have larger hooks in the existing mm code, than
> exported variables and having the hooks live elsewhere
> (where people changing the "normal" mm code won't see
> it, and are more likely to break it).

Hmmm... the original hooks in 2009 were larger, but there
was lots of feedback to hide the ugly details as much as
possible. As a side effect, higher level info is
passed via the hooks, e.g. a "struct page *" rather
than swaptype/entry, so backends have more flexibility
(and IIUC it looks like Andrea's proposed changes to
zcache may need the higher level info).

But if you want to propose some code showing what
you mean by "larger" hooks and they result in the
same information available in the backends, and
if others agree your hooks are more maintainable,
I am certainly open to changing them and re-posting.

Note that this could happen post-frontswap-merge too
though which would, naturally, be my preference ;-)

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-31 15:39:32 UTC
Permalink
> From: James Bottomley [mailto:***@HansenPartnership.com]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)

Hi James --

Thanks for the reply. You raise some good points but
I hope you will read what I believe are reasonable though
long-winded answers.

> On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote:
> > For those who "hack on the VM", I can't imagine why the handful
> > of lines in the swap subsystem, which is probably the most stable
> > and barely touched subsystem in Linux or any OS on the planet,
> > is going to be a burden or much of a cost.
>
> Saying things like this doesn't encourage anyone to trust you. The
> whole of the MM is a complex, highly interacting system. The recent
> issues we've had with kswapd and the shrinker code gives a nice
> demonstration of this ... and that was caused by well tested code
> updates.

I do understand that. My point was that the hooks are
placed _statically_ in largely stable code so it's not
going to constantly get in the way of VM developers
adding new features and fixing bugs, particularly
any developers that don't care about whether frontswap
works or not. I do think that is a very relevant
point about maintenance... do you disagree?

Runtime interactions can only occur if the code is
config'ed and, if config'ed, only if a tmem backend (e.g.
Xen or zcache) enables it also at runtime. When
both are enabled, runtime interactions do occur
and absolutely must be fully tested. My point was
that any _users_ who don't care about whether frontswap
works or not don't need to have any concerns about
VM system runtime interactions. I think this is also
a very relevant point about maintenance... do you
disagree?

> You can't hand wave away the need for benchmarks and
> performance tests.

I'm not. Conclusive benchmarks are available for one user
(Xen) but not (yet) for other users. I've already acknowledged
the feedback desiring benchmarking for zcache, but zcache
is already merged (albeit in staging), and Xen tmem
is already merged in both Linux and the Xen hypervisor,
and cleancache (the alter ego of frontswap) is already
merged.

So the question is not whether benchmarks are waived,
but whether one accepts (1) conclusive benchmarks for Xen;
PLUS (2) insufficiently benchmarked zcache; PLUS (3) at
least two other interesting-but-not-yet-benchmarkable users;
as sufficient for adding this small set of hooks into
swap code.

I understand that some kernel developers (mostly from one
company) continue to completely discount Xen, and
thus won't even look at the Xen results. IMHO
that is mudslinging.

> You have also answered all questions about inactive cost by saying "the
> code has zero cost when it's compiled out" This also is a non starter.
> For the few use cases it has, this code has to be compiled in. I
> suspect even Oracle isn't going to ship separate frontswap and
> non-frontswap kernels in its distro. So you have to quantify what the
> performance impact is when this code is compiled in but not used.
> Please do so.

First, no, Oracle is not going to ship separate frontswap and
non-frontswap kernels. It IS going to ship a frontswap-enabled
kernel and this can be seen in Oracle's publicly-available
kernel git tree (the next release, now in Beta). Frontswap is
compiled in, but still must be enabled at runtime (e.g. for
a Xen guest, either manually by the guest's administrator
or automagically by the Oracle VM product's management layer).

I did fully quantify the performance impact elsewhere in
this thread. The performance impact with CONFIG_FRONTSWAP=n
(which is ZERO) is relevant for distros which choose to
ignore it entirely. The performance impact for CONFIG_FRONTSWAP=y
but not-enabled-at-runtume is one compare-pointer-against-NULL
per page actually swapped in or out (essentially ZERO);
this is relevant for distros which choose to configure it
enabled in case they wish to enable it at runtime in
the future.

So the remaining question is the performance impact when
compile-time AND runtime enabled; this is in the published
Xen presentation I've referenced -- the impact is much much
less than the performance gain. IMHO benchmark results can
be easily manipulated so I prefer to discuss the theoretical
underpinnings which, in short, is that just about anything
a tmem backend does (hypercall, compression, deduplication,
even moving data across a fast network) is a helluva lot
faster than swapping a page to disk.

Are there corner cases and probably even real workloads
where the cost exceeds the benefits? Probably... though
less likely for frontswap than for cleancache because ONLY
pages that would actually be swapped out/in use frontswap.

But I have never suggested that every kernel should always
unconditionally compile-time-enable and run-time-enable
frontswap... simply that it should be in-tree so those
who wish to enable it are able to enable it.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
James Bottomley
2011-11-01 10:13:23 UTC
Permalink
On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote:
> > From: James Bottomley [mailto:***@HansenPartnership.com]
> > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
>
> Hi James --
>
> Thanks for the reply. You raise some good points but
> I hope you will read what I believe are reasonable though
> long-winded answers.
>
> > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote:
> > > For those who "hack on the VM", I can't imagine why the handful
> > > of lines in the swap subsystem, which is probably the most stable
> > > and barely touched subsystem in Linux or any OS on the planet,
> > > is going to be a burden or much of a cost.
> >
> > Saying things like this doesn't encourage anyone to trust you. The
> > whole of the MM is a complex, highly interacting system. The recent
> > issues we've had with kswapd and the shrinker code gives a nice
> > demonstration of this ... and that was caused by well tested code
> > updates.
>
> I do understand that. My point was that the hooks are
> placed _statically_ in largely stable code so it's not
> going to constantly get in the way of VM developers
> adding new features and fixing bugs, particularly
> any developers that don't care about whether frontswap
> works or not. I do think that is a very relevant
> point about maintenance... do you disagree?

Well, as I've said, all the mm code is highly interacting, so I don't
really see it as "stable" in the way you suggest. What I'm saying is
that you need to test a variety of workloads to demonstrate there aren't
any nasty interactions.

> Runtime interactions can only occur if the code is
> config'ed and, if config'ed, only if a tmem backend (e.g.
> Xen or zcache) enables it also at runtime.

So this, I don't accept without proof ... that's what we initially said
about the last set of shrinker updates that caused kswapd to hang
sandybridge systems ...

> When
> both are enabled, runtime interactions do occur
> and absolutely must be fully tested. My point was
> that any _users_ who don't care about whether frontswap
> works or not don't need to have any concerns about
> VM system runtime interactions. I think this is also
> a very relevant point about maintenance... do you
> disagree?

I'm sorry, what point about maintenance?

> > You can't hand wave away the need for benchmarks and
> > performance tests.
>
> I'm not. Conclusive benchmarks are available for one user
> (Xen) but not (yet) for other users. I've already acknowledged
> the feedback desiring benchmarking for zcache, but zcache
> is already merged (albeit in staging), and Xen tmem
> is already merged in both Linux and the Xen hypervisor,
> and cleancache (the alter ego of frontswap) is already
> merged.

The test results for Xen I've seen are simply that "we're faster than
swapping to disk, and we can be even better if you use self ballooning".
There's no indication (at least in the Xen Summit presentation) what the
actual workloads were.

> So the question is not whether benchmarks are waived,
> but whether one accepts (1) conclusive benchmarks for Xen;
> PLUS (2) insufficiently benchmarked zcache; PLUS (3) at
> least two other interesting-but-not-yet-benchmarkable users;
> as sufficient for adding this small set of hooks into
> swap code.

That's the point: even for Xen, the benchmarks aren't "conclusive".
There may be a workload for which transcendent memory works better, but
make -j8 isn't enough of a variety of workloads)

> I understand that some kernel developers (mostly from one
> company) continue to completely discount Xen, and
> thus won't even look at the Xen results. IMHO
> that is mudslinging.

OK, so lets look at this another way: one of the signs of a good ABI is
generic applicability. Any good virtualisation ABI should thus work for
all virtualisation systems (including VMware should they choose to take
advantage of it). The fact that transcendent memory only seems to work
well for Xen is a red flag in this regard.

> > You have also answered all questions about inactive cost by saying "the
> > code has zero cost when it's compiled out" This also is a non starter.
> > For the few use cases it has, this code has to be compiled in. I
> > suspect even Oracle isn't going to ship separate frontswap and
> > non-frontswap kernels in its distro. So you have to quantify what the
> > performance impact is when this code is compiled in but not used.
> > Please do so.
>
> First, no, Oracle is not going to ship separate frontswap and
> non-frontswap kernels. It IS going to ship a frontswap-enabled
> kernel and this can be seen in Oracle's publicly-available
> kernel git tree (the next release, now in Beta). Frontswap is
> compiled in, but still must be enabled at runtime (e.g. for
> a Xen guest, either manually by the guest's administrator
> or automagically by the Oracle VM product's management layer).
>
> I did fully quantify the performance impact elsewhere in
> this thread. The performance impact with CONFIG_FRONTSWAP=n
> (which is ZERO) is relevant for distros which choose to
> ignore it entirely. The performance impact for CONFIG_FRONTSWAP=y
> but not-enabled-at-runtume is one compare-pointer-against-NULL
> per page actually swapped in or out (essentially ZERO);
> this is relevant for distros which choose to configure it
> enabled in case they wish to enable it at runtime in
> the future.

So what I don't like about this style of argument is the sleight of
hand: I would expect the inactive but configured case to show mostly in
the shrinker paths, which is where our major problems have been, so that
would be cleancache, not frontswap, wouldn't it?

> So the remaining question is the performance impact when
> compile-time AND runtime enabled; this is in the published
> Xen presentation I've referenced -- the impact is much much
> less than the performance gain. IMHO benchmark results can
> be easily manipulated so I prefer to discuss the theoretical
> underpinnings which, in short, is that just about anything
> a tmem backend does (hypercall, compression, deduplication,
> even moving data across a fast network) is a helluva lot
> faster than swapping a page to disk.
>
> Are there corner cases and probably even real workloads
> where the cost exceeds the benefits? Probably... though
> less likely for frontswap than for cleancache because ONLY
> pages that would actually be swapped out/in use frontswap.
>
> But I have never suggested that every kernel should always
> unconditionally compile-time-enable and run-time-enable
> frontswap... simply that it should be in-tree so those
> who wish to enable it are able to enable it.

In practise, most useful ABIs end up being compiled in ... and useful
basically means useful to any constituency, however small. If your ABI
is useless, then fine, we don't have to worry about the configured but
inactive case (but then again, we wouldn't have to worry about the ABI
at all). If it has a use, then kernels will end up shipping with it
configured in which is why the inactive performance impact is so
important to quantify.

James



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 19:39:49 UTC
Permalink
> From: James Bottomley [mailto:***@HansenPartnership.com]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)

> Hm, straw man and ad hominem....

Let me apologize to you also for being sarcastic and
disrespectful yesterday. I'm very sorry, I really do
appreciate your time and effort, and will try to focus
on the core of your excellent feedback, rather than
write another long rant.

> > Case A) CONFIG_FRONTSWAP=n
> > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers
> > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register
>
> OK, so what I'd like to see is benchmarks for B and C. B should confirm
> your contention of no cost (which is the ideal anyway) and C quantifies
> the passive cost to users.

OK, we'll see what we can do. For B, given the natural
variance in any workload that is doing heavy swapping,
I'm not sure that I can prove anything, but I suppose
it will at least reveal if there are any horrible
glaring bugs. However, in turn, I'd ask you to at
least confirm by code examination that, not counting
swapon and swapoff, the only change to the swapping
path is comparing a function pointer in struct
frontswap_ops against NULL. (And, for case B, it
is NULL, so no function call ever occurs.) OK?

For C, understood, benchmarks for zcache needed.

> Well, OK, so there's a performance issue in some workloads what the
> above is basically asking is how bad is it and how widespread?

Just to clarify, the performance issue observed is
with cleancache with zcache, not frontswap. That issue
has been observed on high-throughput old-single-core-CPU
machines, see https://lkml.org/lkml/2011/8/29/225

That issue is because cleancache (like the pagecache)
has to speculate on what pages might be needed in
the future.

Frontswap with zcache ONLY compresses pages that would
otherwise be physically swapped to a swap device.

So I don't see a performance issue with frontswap.
(But, yes, will still provide some benchmarks.)

> What I said was "one of the signs of a
> good ABI is generic applicability". That doesn't mean you have to apply
> an ABI to every situation by coming up with a demonstration for the use
> case. It does mean that people should know how to do it. I'm not
> particularly interested in the hypervisor wars, but it does seem to me
> that there are legitimate questions about the applicability of this to
> KVM.

The guest->host ABI does work with KVM, and is in Sasha's
git tree. It is a very simple shim, very similar to what
Xen uses, and will feed the same "opportunities" for swapping
to host memory for KVM as for Xen.

The arguments regarding KVM are whether, when the ABI is
used, if there is a sufficient performance gain, because
each page requires a (costly vmexit/vmenter sequence).
It seems obvious to me, but I've done what I can to
facilitate Sasha's and Neo's tmem-on-KVM work... their
code is just not finished yet. As I've discussed with
Andrea, the ABI is very extensible so if it makes a huge
difference to add "batching" for KVM, the ABI won't get
in the way.

> As I said above, just benchmark it for B and C. As long as nothing nasty
> is happening, I'm fine with it.
>
> > So... understanding your preference for more workloads and your
> > preference that KVM should be demonstrated as a profitable user
> > first... is there anything else that you think should stand
> > in the way of merging frontswap so that existing and planned
> > kernel developers can build on top of it in-tree?
>
> No, I think that's my list. The confusion over a KVM interface is
> solely because you keep saying it's not a Xen only ABI ... if it were,
> I'd be fine for it living in the xen tree.

OK, thanks! But the core frontswap hooks are in routines in
mm/swapfile.c and mm/page_io.c so can't live in the xen tree.
And the Xen-specific stuff already does.

Sorry, getting long-winded again, but at least not ranting :-}

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 18:44:43 UTC
Permalink
On Fri, Oct 28, 2011 at 02:28:20PM -0400, John Stoffel wrote:
> and service. How would TM benefit me? I don't use Xen, don't want to
> play with it honestly because I'm busy enough as it is, and I just
> don't see the hard benefits.

If you used Xen tmem would be more or less the equivalent of
cache=writethrough/writeback. For us tmem is the linux host pagecache
running on the baremetal in short. But at least when we vmexit for a
read we read 128-512k of it (depending on if=virtio or others and
guest kernel readahead decision), not just a fixed absolutely worst
case 4k unit like tmem would do...

Without tmem Xen can only work like KVM cache=off.

If at least it would drop us a copy, but no, it still does the bounce
buffer, so I'd rather bounce in the host kernel function
file_read_actor than in some superflous (as far as KVM is concerned)
tmem code, plus we normally read orders of magnitude more than 4k in
each vmexit, so our default cache=writeback/writethroguh may already
be more efficient than if we'd use tmem for that.

We could only consider for swap compression but for swap compression
I've no idea why we still need to do a copy, instead of just
compressing from userland page in zerocopy (worst case using any
mechanism introduced to provide stable pages).

And when host linux pagecache will go hugepage we'll get a >4k copy in
one go while tmem bounce will still be stuck at 4k...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Johannes Weiner
2011-10-30 21:47:48 UTC
Permalink
On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote:
>
> > From: Johannes Weiner [mailto:***@redhat.com]
> > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
> >
> > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote:
> > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
> > > <***@oracle.com> wrote:
> > > Looking at your patches, there's no trace that anyone outside your own
> > > development team even looked at the patches. Why do you feel that it's
> > > OK to ask Linus to pull them?
> >
> > People did look at it.
> >
> > In my case, the handwavy benefits did not convince me. The handwavy
> > 'this is useful' from just more people of the same company does not
> > help, either.
> >
> > I want to see a usecase that tangibly gains from this, not just more
> > marketing material. Then we can talk about boring infrastructure and
> > adding hooks to the VM.
> >
> > Convincing the development community of the problem you are trying to
> > solve is the undocumented part of the process you fail to follow.
>
> Hi Johannes --
>
> First, there are several companies and several unaffiliated kernel
> developers contributing here, building on top of frontswap. I happen
> to be spearheading it, and my company is backing me up. (It
> might be more appropriate to note that much of the resistance comes
> from people of your company... but please let's keep our open-source
> developer hats on and have a technical discussion rather than one
> which pleases our respective corporate overlords.)

I didn't mean to start a mud fight about this, I only mentioned the
part about your company because I already assume it sees value in tmem
- it probably wouldn't fund its development otherwise. I just tend to
not care too much about Acks from the same company as the patch itself
and I believe other people do the same.

> Second, have you read http://lwn.net/Articles/454795/ ?
> If not, please do. If yes, please explain what you don't
> see as convincing or tangible or documented. All of this
> exists today as working publicly available code... it's
> not marketing material.

I remember answering this to you in private already some time ago when
discussing frontswap.

You keep proposing a bridge and I keep asking for proof that this is
not a bridge to nowhere. Unless that question is answered, I am not
interested in discussing the bridge's design.

According to the LWN article, there are the following backends:

1. Zcache: allow swapping into compressed memory

This sets aside a portion of memory which the kernel will swap
compressed pages into upon pressure. Now, obviously, reserving memory
from the system for this increases the pressure in the first place,
eating away on what space we have for anonymous memory and page cache.

Do you auto-size that region depending on workload?

If so, how? If not, is it documented how to size it manually?

Where are the performance numbers for various workloads, including
both those that benefit from every bit of page cache and those that
would fit into memory without zcache occupying space?

However, looking at the zcache code, it seems it wants to allocate
storage pages only when already trying to swap out. Are you sure this
works in reality?

2. RAMster: allow swapping between machines in a cluster

Are there people using it? It, too, sounds like a good idea but I
don't see any proof it actually works as intended.

3. Xen: allow guests to swap into the host.

The article mentions that there is code to put the guests under
pressure and let them swap to host memory when the pressure is too
high. This sounds useful.

Where is the code that controls the amount of pressure put on the
guests?

Where are the performance numbers? Surely you can construct a case
where the initial machine sizes are not quite right and then collect
data that demonstrates the machines are rebalancing as expected?

4. kvm: same as Xen

Apart from the questions that already apply to Xen, I remember KVM
people in particular complaining about the synchroneous single-page
interface that results in a hypercall per swapped page. What happened
to this concern?

---

I would really appreciate if you could pick one of those backends and
present them as a real and practical solution to real and practical
problems. With documentation on configuration and performance data of
real workloads. We can discuss implementation details like how memory
is exchanged between source and destination when we come to it.

I am not asking for just more code that uses your interface, I want to
know the real value for real people of the combination of all that
stuff. With proof, not just explanations of how it's supposed to
work.

Until you can accept that, please include

Nacked-by: Johannes Weiner <***@cmpxchg.org>

on all further stand-alone submissions of tmem core code and/or hooks
in the VM. Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 18:34:23 UTC
Permalink
On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote:
> First, there are several companies and several unaffiliated kernel
> developers contributing here, building on top of frontswap. I happen
> to be spearheading it, and my company is backing me up. (It
> might be more appropriate to note that much of the resistance comes
> from people of your company... but please let's keep our open-source
> developer hats on and have a technical discussion rather than one
> which pleases our respective corporate overlords.)

Fair enough to want an independent review but I'd be interesting to
also know how many of the several companies and unaffiliated kernel
developers are contributing to it that aren't using tmem with
Xen. Obviously bounce buffers 4k vmexits are still faster than
Xen-paravirt-I/O on disk platter...

Note, Hugh is working for another company... and they're using cgroups
not KVM nor Xen, so I suggests he'd be a fair reviewer from a non-virt
standpoint, if he hopefully has the time to weight in.

However keep in mind if we'd see something that can allow KVM to run
even faster, we'd be quite silly in not taking advantage of it too, to
beat our own SPECvirt record. The whole design idea of KVM (unlike
Xen) is to reuse the kernel improvements as much as possible so when
the guest runs faster the hypervisor also runs faster with the exact
same code. Problem a vmexit doing a bounce buffer every 4k doesn't mix
well into SPECvirt in my view and that probably is what has kept us
from making any attempt to use tmem API anywhere.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-30 23:19:26 UTC
Permalink
> From: Johannes Weiner [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Johannes --

Thanks for taking the time for some real technical discussion (below).

> On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote:
> >
> > > From: Johannes Weiner [mailto:***@redhat.com]
> > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
> > >
> > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote:
> > > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
> > > > <***@oracle.com> wrote:
> > > > Looking at your patches, there's no trace that anyone outside your own
> > > > development team even looked at the patches. Why do you feel that it's
> > > > OK to ask Linus to pull them?
> > >
> > > People did look at it.
> > >
> > > In my case, the handwavy benefits did not convince me. The handwavy
> > > 'this is useful' from just more people of the same company does not
> > > help, either.
> > >
> > > I want to see a usecase that tangibly gains from this, not just more
> > > marketing material. Then we can talk about boring infrastructure and
> > > adding hooks to the VM.
> > >
> > > Convincing the development community of the problem you are trying to
> > > solve is the undocumented part of the process you fail to follow.
> >
> > Hi Johannes --
> >
> > First, there are several companies and several unaffiliated kernel
> > developers contributing here, building on top of frontswap. I happen
> > to be spearheading it, and my company is backing me up. (It
> > might be more appropriate to note that much of the resistance comes
> > from people of your company... but please let's keep our open-source
> > developer hats on and have a technical discussion rather than one
> > which pleases our respective corporate overlords.)
>
> I didn't mean to start a mud fight about this, I only mentioned the
> part about your company because I already assume it sees value in tmem
> - it probably wouldn't fund its development otherwise. I just tend to
> not care too much about Acks from the same company as the patch itself
> and I believe other people do the same.

Oops, sorry for mudslinging if none was intended.

Although I understand your position about Acks from the same company,
isn't that challenging the integrity of the individual's ack/review,
implying that they are not really reviewing the code with the same
intensity as if it came from another company? Especially with
something like tmem, maybe the review is just as valid, and people
from the same company have just had more incentive to truly
understand the intent and potential of the functionality, as well as
the syntax in the code? And maybe, on some patches, reviewers ARE
from different companies are "good buddies" and watch each others'
back and those reviews are not really complete?

So perhaps this default assumption about code review is flawed?

> > Second, have you read http://lwn.net/Articles/454795/ ?
> > If not, please do. If yes, please explain what you don't
> > see as convincing or tangible or documented. All of this
> > exists today as working publicly available code... it's
> > not marketing material.
>
> I remember answering this to you in private already some time ago when
> discussing frontswap.

Yes, reading ahead, all the questions sound familiar and I thought
they were all answered (albeit some offlist). I think the conversation
ended at that point, so I assumed any issues were resolved.

> You keep proposing a bridge and I keep asking for proof that this is
> not a bridge to nowhere. Unless that question is answered, I am not
> interested in discussing the bridge's design.
>
> According to the LWN article, there are the following backends:
>
> 1. Zcache: allow swapping into compressed memory
>
> This sets aside a portion of memory which the kernel will swap
> compressed pages into upon pressure. Now, obviously, reserving memory
> from the system for this increases the pressure in the first place,
> eating away on what space we have for anonymous memory and page cache.
>
> Do you auto-size that region depending on workload?

Yes. A key value of the whole transcendent memory design
is that everything is done dynamically. That's one
reason that Nitin Gupta (author of zram) supports zcache.

> If so, how? If not, is it documented how to size it manually?

See above. There are some zcache policy parameters that can be
adjusted manually (currently through sysfs) so we can adjust
the defaults as necessary over time.

> Where are the performance numbers for various workloads, including
> both those that benefit from every bit of page cache and those that
> would fit into memory without zcache occupying space?

I have agreed already that more zcache measurement is warranted
(though I maintain it will get a lot more measurement merged than
it will unmerged). So I can only answer theoretically, though
I would appreciate your comment if you disagree.

Space used for page cache is almost always opportunistic; it is
a "guess" that the page will be needed again in the future.
Frontswap only stores pages that MUST otherwise be swapped.
Swapping occurs only if the clean list is empty (or if the
MM system is too slow to respond to changes in workload).
In fact some of the pages-to-be-swapped that end up in
frontswap can be dirty page cache pages.

All of this is handled dynamically. The kernel is still deciding
which pages to keep and which to reclaim and which to swap.
The hooks simply grab pages as they are going by. That's
why the frontswap patch can be so simple and can have many "users"
built on top of it.

> However, looking at the zcache code, it seems it wants to allocate
> storage pages only when already trying to swap out. Are you sure this
> works in reality?

Yes. I'd encourage you to try it. I'd be a fool if I tried
to guarantee that there are no bugs of course.

> 2. RAMster: allow swapping between machines in a cluster
>
> Are there people using it? It, too, sounds like a good idea but I
> don't see any proof it actually works as intended.

No. I've posted the code publicly but it's still a godawful mess
and I'd be embarrassed if anyone looked at it. But the code
does work and I've got some ideas on how to make it more
upstreamable. If anybody seriously wants to work on it right
now, I could do that, but I'd prefer some more time alone with
it first.

Conceptually, it's just a matter of moving pages to a different
machine instead of across a hypercall interface. All the "magic"
is in the frontswap and cleancache hooks. They run on both
machines, both dynamically managing space (and compressing it
too). The code uses ocfs2 for "cluster" discovery and is built
on top of a modified zcache.

> 3. Xen: allow guests to swap into the host.
>
> The article mentions that there is code to put the guests under
> pressure and let them swap to host memory when the pressure is too
> high. This sounds useful.
>
> Where is the code that controls the amount of pressure put on the
> guests?

See drivers/xen/xen-selfballoon.c, which was just merged at 3.1,
though there have been versions of it floating around for 2+ years.
Note there's a bug fix pending that makes the pressure a little less
aggressive. I think it is/was submitted for the open 3.2 window.
(Note the same file manipulates the number of pages in frontswap.)

> Where are the performance numbers? Surely you can construct a case
> where the initial machine sizes are not quite right and then collect
> data that demonstrates the machines are rebalancing as expected?

Yes I can. It just works and with the right tools running, it's
even fun to watch. Some interesting performance numbers were
published at Xen Summit 2010. See the last few pages of:

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf

The speakers notes (so you can follow the presentation without video)
are in the same dir.

> 4. kvm: same as Xen
>
> Apart from the questions that already apply to Xen, I remember KVM
> people in particular complaining about the synchroneous single-page
> interface that results in a hypercall per swapped page. What happened
> to this concern?

I think we (me and the KVM people) agreed that the best way to determine
if this is a concern is to just measure it. Sasha and Neo are working on
a KVM implementation which should make this possible (but neither wants
to invest a lot of time if frontswap isn't merged or has a clear path
to merging).

So, again, theoretically, and please argue if you disagree...
(and yes I know real measurements are better, but I think we all
know how easy it is to manipulate benchmarks so IMHO a
theoretical understanding is useful too).

What is the cost of a KVM hypercall (vmexit/vmenter) vs the cost of
swapping a page? Clearly, reading/writing a disk is a very slow
operation, but has very little CPU overhead (though preparing a
page to be swapped via blkio is NOT very inexpensive). But if
you are swapping, it is almost never the case that the CPU is busy,
especially on a multicore CPU.

I expect on old slow (e.g. first gen 1 core VT-x processors) this might
sometimes be measureable, but rarely an issue. On modern processors,
I don't expect it to be significant.

BTW, it occurs to me that this is now measureable on Xen too, since
Xen tmem works now for fully-virtualized guests. I don't have
the machines to reproduce the same experiment, but if you look at
the graphs in the Xen presentation, you can see that CPU utilization
goes up substantially, but throughput still improves. I am almost
positive that the CPU cost of compression/decompression plus the
cost of deduplication insert/fetch exceeds the cost of a vmexit/vmenter,
so the additional cost of vmexit/vmenter will at most increase
the CPU utilization. The real performance gain comes from avoiding
(waiting for) disk accesses.

> I would really appreciate if you could pick one of those backends and
> present them as a real and practical solution to real and practical
> problems. With documentation on configuration and performance data of
> real workloads. We can discuss implementation details like how memory
> is exchanged between source and destination when we come to it.
>
> I am not asking for just more code that uses your interface, I want to
> know the real value for real people of the combination of all that
> stuff. With proof, not just explanations of how it's supposed to
> work.

Well, the Xen implementation is by far the most mature and the
Xen presentation above is reasonably conclusive though, as always,
more measurements of more workloads would be good.

Not to get back into the mudslinging, but certain people from certain
companies try to ignore or minimize the value of Xen, so I've been
trying to emphasize the other (non-Xen, non-virtualization) code.
Personally, I think the Xen use case is sufficient by itself as it
solves a problem nobody else has ever solved (or, more precisely,
that VMware attempted to solve but, as real VMware customers will
attest, did so very poorly).

To be a good Linux kernel citizen, I've encouraged my company to hold
off on widespread support for Xen tmem until all the parts are upstream
in Linux, so there isn't a wide existing body of "proof" data. And
releasing customer data from my employer requires an act of God. But
private emails to Linus for cleancache seemed to convince him that
there was enough justification for cleancache. I thought frontswap
was simpler and would be the easy part, but was clearly mistaken :-(
We are now proceeding fully with Xen tmem with both frontswap
and cleancache in the kernel.

> Until you can accept that, please include
>
> Nacked-by: Johannes Weiner <***@cmpxchg.org>
>
> on all further stand-alone submissions of tmem core code and/or hooks
> in the VM. Thanks.

If you are willing to accept that Xen is a valid use case, I
think I have provided that (although I agree that more data would
be good and would be happy to take suggestions for what data to
provide). If not, I would call that a form of mudslinging
but will add your Nack. Please let me know.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 18:16:51 UTC
Permalink
On Fri, Oct 28, 2011 at 08:21:31AM -0700, Dan Magenheimer wrote:
> real users and real distros and real products waiting, so if there
> are any real issues, let's get them resolved.

We already told you the real issues there are and you did nothing so
far to address those, so much was built on top of a flawed API that I
guess an heartquake of massive scale has to come in to actually
convince Xen to change any of the huge amount of code built on the
flawed API.

I don't know the exact Xen details (it's possible Xen design doesn't
allow these below 4 issues to be fixed, I've no idea) but for all
other non-virt usages (compressed-swap/compressed-pagecache, ramster)
I doubt it is impossible to change the design of the tmem API to
address at least one of those basic huge troubles that such an API
imposes:

1) 4k page limit (no way to handle hugepages)

Ok swapcache and pagecache are always 4k, but that may change. Plus
it's generally flawed these days to add a new API people will build
code on that can't handle hugepages, at least hugetlbfs should be
handled. And especially considering it was born for virt, in virt
space we only work with hugepages.

2) synchronous

3) not zerocopy, requires one bounce buffer for every get and one
bounce buffer again for every put (like highmem I/O with 32bit pci)

In my view point 3 is definitely fixable for swapcache compression
and pagecache compression, there's no way we can accept a copy before
starting compressing the data, the source of the compression
algorithm must be the _userland_ page but instead you copy first, and
compress on the copy destination, correct me if I'm wrong.

4) can't handle batched requests

Requires one vmexit for each 4k page accessed if KVM hypervisor wants
to access tmem, it's impossible we want to use this in KVM, at most
we could consider exiting every 2M page, impossible to vmexit every
4k or performance is destroyed and we'd run as slow as no-EPT/NPT.

Address these 4 points (or at least the ones that are solvable) and
it'll become appealing. Or at least try to explain why it's impossible
to solve all these 4 points to convince us this API is the best we can
get for the non-virt usages (let's ignore Xen/KVM for the sake of this
discussion, as Xen may have legitimate reasons for why those 4 above
points are impossible to fix).

At the moment to me it still looks a legacy-compatibility API to make
life easier to Xen users that uses a limited API (at least it's
simpler I'd agree on it being simpler this way) to share cache across
different guests and tries to impose those above 4 limits (and
horrendous performance in accessing tmem from Xen Guest but still
faster than I/O isn't it? :) even to the non-virt usages.

Even frontswap, there is no way we can accept to do synchronous bounce
buffers for every single 4k page that is going to hit swap. That's
worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle
db memory so you won't hit that bounce buffering ever with
Oracle. Also note, historically there's nobody that hated bounce
buffers more than Oracle (at least I remember the highmem issues with
pci32 cards :). Also Oracle was the biggest user of hugetlbfs.

So it sounds weird that you like this API forces bounce buffering CPU
cache-destroying and 4k page units, for everything that passes through
it.

If I'm wrong please correct me, I hadn't lots of time to check
code. But we already raised these points before without much answer.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
James Bottomley
2011-11-01 10:16:30 UTC
Permalink
On Mon, 2011-10-31 at 19:16 +0100, Andrea Arcangeli wrote:
> On Fri, Oct 28, 2011 at 08:21:31AM -0700, Dan Magenheimer wrote:
> > real users and real distros and real products waiting, so if there
> > are any real issues, let's get them resolved.
>
> We already told you the real issues there are and you did nothing so
> far to address those, so much was built on top of a flawed API that I
> guess an heartquake of massive scale has to come in to actually
> convince Xen to change any of the huge amount of code built on the
> flawed API.
>
> I don't know the exact Xen details (it's possible Xen design doesn't
> allow these below 4 issues to be fixed, I've no idea) but for all
> other non-virt usages (compressed-swap/compressed-pagecache, ramster)
> I doubt it is impossible to change the design of the tmem API to
> address at least one of those basic huge troubles that such an API
> imposes:

Actually, I think there's an unexpressed fifth requirement:

5. The optimised use case should be for non-paging situations.

The problem here is that almost every data centre person tries very hard
to make sure their systems never tip into the swap zone. A lot of
hosting datacentres use tons of cgroup controllers for this and
deliberately never configure swap which makes transcendent memory
useless to them under the current API. I'm not sure this is fixable,
but it's the reason why a large swathe of users would never be
interested in the patches, because they by design never operate in the
region transcended memory is currently looking to address.

This isn't an inherent design flaw, but it does ask the question "is
your design scope too narrow?"

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Avi Kivity
2011-11-02 15:44:50 UTC
Permalink
On 11/01/2011 12:16 PM, James Bottomley wrote:
> Actually, I think there's an unexpressed fifth requirement:
>
> 5. The optimised use case should be for non-paging situations.
>
> The problem here is that almost every data centre person tries very hard
> to make sure their systems never tip into the swap zone. A lot of
> hosting datacentres use tons of cgroup controllers for this and
> deliberately never configure swap which makes transcendent memory
> useless to them under the current API. I'm not sure this is fixable,
> but it's the reason why a large swathe of users would never be
> interested in the patches, because they by design never operate in the
> region transcended memory is currently looking to address.
>
> This isn't an inherent design flaw, but it does ask the question "is
> your design scope too narrow?"

If you look at cleancache, then it addresses this concern - it extends
pagecache through host memory. When dropping a page from the tail of
the LRU it first goes into tmem, and when reading in a page from disk
you first try to read it from tmem. However in many workloads,
cleancache is actually detrimental. If you have a lot of cache misses,
then every one of them causes a pointless vmexit; considering that
servers today can chew hundreds of megabytes per second, this adds up.
On the other side, if you have a use-once workload, then every page that
falls of the tail of the LRU causes a vmexit and a pointless page copy.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-02 16:02:01 UTC
Permalink
Hi Avi,

On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote:
> If you look at cleancache, then it addresses this concern - it extends
> pagecache through host memory. When dropping a page from the tail of
> the LRU it first goes into tmem, and when reading in a page from disk
> you first try to read it from tmem. However in many workloads,
> cleancache is actually detrimental. If you have a lot of cache misses,
> then every one of them causes a pointless vmexit; considering that
> servers today can chew hundreds of megabytes per second, this adds up.
> On the other side, if you have a use-once workload, then every page that
> falls of the tail of the LRU causes a vmexit and a pointless page copy.

I also think it's bad design for Virt usage, but hey, without this
they can't even run with cache=writeback/writethrough and they're
forced to cache=off, and then they claim specvirt is marketing, so for
Xen it's better than nothing I guess.

I'm trying right now to evaluate it as a pure zcache host side
optimization. If it can drive us in the right long term direction and
we're free to modify it as we wish to boost swapping I/O too using
compressed data, then it may be viable. Otherwise it's better they add
some Xen specific hook and leave whatever zcache infrastructure "free
to be modified as the VM needs" "as Xen needs not". I currently don't
know exactly where the Xen ABI starts and the kernel stops in tmem so
it's hard to tell how hackable it is and if it is actually a
complication to try to hide things away from the VM or not. Certainly
the highly advertised automatic dynamic sizing of the tmem pools is an
OOM timebomb without proper VM control on it. So it just can't stay
away from the VM too much. Currently it's unlikely to be safe in all
workloads (i.e. mlockall growing fast).

Whatever happens in tmem it must be still "owned by the kernel" so it
will be written out to disk with bios. Doesn't need to happpen
immediately, doesn't need to be perfect, but must definitely be
possible to add it later without Xen folks complaining at whatever
change we do in tmem.

The fact not a line of code of Xen was written over the last two
years, doesn't mean there aren't dependencies on the code, just maybe
those never broke and so Xen never needed to be modified either becuse
they kept the tmem ABI/API fixed while adding the other backends of
tmem (zcache etc..). I mean just the fact I read in those emails the
word "ABI" signals something is wrong. There can't be any ABI there,
only an API and even the API is a kernel internal one so it must be
allowed to break freely. Or we can't innovate. Again if we can't
change whatever ABI/API without first talking with the Xen folks I
think it's better they split the two projects and just submit the Xen
hooks separately. That wouldn't remove value to tmem (assuming it's
the way to go which I'm not entirely convinced yet).

In any case starting fixing up the zcache layer sounds good to me,
first things that come to mind are to document with a comment why it
disables irqs and which is the exact code racing with the compression
that runs from irqs or softirqs, fix the casts in tmem_put, rename
tmem_put to tmem_store etc... Then we see if Xen side complains by
just those small needed cleanups.

Ideally the API should also be stackable so you can do ramster on top
of zcache on top of cleancache/frontswap so we can write a swap driver
for the zcache and we can do swapper -> zcache -> frontswap, we could
even write compressed pagecache to disk that way.

And the whole thing should handle all allocation failures with a
fallback all up to the top layer (which for swap would mean go to the
regular swapout path if oom happens within those calls and for
pagecache would mean to really free the page not compress it in some
tmem memory). That is a design that may be good. I hadn't an huge
amount of time to think about it but if you remove virt from the
equation it looks less bad.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Avi Kivity
2011-11-02 16:13:11 UTC
Permalink
On 11/02/2011 06:02 PM, Andrea Arcangeli wrote:
> Hi Avi,
>
> On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote:
> > If you look at cleancache, then it addresses this concern - it extends
> > pagecache through host memory. When dropping a page from the tail of
> > the LRU it first goes into tmem, and when reading in a page from disk
> > you first try to read it from tmem. However in many workloads,
> > cleancache is actually detrimental. If you have a lot of cache misses,
> > then every one of them causes a pointless vmexit; considering that
> > servers today can chew hundreds of megabytes per second, this adds up.
> > On the other side, if you have a use-once workload, then every page that
> > falls of the tail of the LRU causes a vmexit and a pointless page copy.
>
> I also think it's bad design for Virt usage, but hey, without this
> they can't even run with cache=writeback/writethrough and they're
> forced to cache=off, and then they claim specvirt is marketing, so for
> Xen it's better than nothing I guess.

Surely Xen can use the pagecache, it uses Linux for I/O just like kvm.

> I'm trying right now to evaluate it as a pure zcache host side
> optimization.

zcache style usage is fine. It's purely internal so no ABI constraints,
and no hypercalls either. It's still synchronous though so RAMster like
approaches will not work well.


<snip>

--
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 20:27:01 UTC
Permalink
> From: Avi Kivity [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On 11/02/2011 06:02 PM, Andrea Arcangeli wrote:
> > Hi Avi,
> >
> > On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote:
> > > If you look at cleancache, then it addresses this concern - it extends
> > > pagecache through host memory. When dropping a page from the tail of
> > > the LRU it first goes into tmem, and when reading in a page from disk
> > > you first try to read it from tmem. However in many workloads,
> > > cleancache is actually detrimental. If you have a lot of cache misses,
> > > then every one of them causes a pointless vmexit; considering that
> > > servers today can chew hundreds of megabytes per second, this adds up.
> > > On the other side, if you have a use-once workload, then every page that
> > > falls of the tail of the LRU causes a vmexit and a pointless page copy.
> >
> > I also think it's bad design for Virt usage, but hey, without this
> > they can't even run with cache=writeback/writethrough and they're
> > forced to cache=off, and then they claim specvirt is marketing, so for
> > Xen it's better than nothing I guess.
>
> Surely Xen can use the pagecache, it uses Linux for I/O just like kvm.
>
> > I'm trying right now to evaluate it as a pure zcache host side
> > optimization.
>
> zcache style usage is fine. It's purely internal so no ABI constraints,
> and no hypercalls either. It's still synchronous though so RAMster like
> approaches will not work well.

Still experimental, but only the initial local put must be synchronous.
RAMster uses a separate thread to "remotify" pre-compressed pages.
The "get" still needs to be synchronous, but (if I ever have time to
get back to coding it) I've got some ideas on how to fix that. If
I manage to get that working, perhaps it could be used for Andrea's
write-precompressed-zcache-pages-to-disk.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 20:08:14 UTC
Permalink
> From: James Bottomley [mailto:***@HansenPartnership.com]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
>
> > Not quite sure what you mean here (especially for frontswap)...
>
> I mean could it be used in a more controlled situation than an
> alternative to swap?

I think it could, but have focused on the cases which reduce
disk I/O: cleancache, which replaces refaults, and frontswap,
which replaces swap in/outs. Did you have some other
kernel data in mind?

> OK, I still don't think you understand what I'm saying. Machines in a
> Data Centre tend to be provisioned to criticality. What this means is
> that the Data Centre has a bunch of mandatory work and a bunch of Best
> Effort work (and grades in between). We load up the mandatory work
> according to the resource limits being careful not to overprovision the
> capacity then we look at the spare capacity and slot in the Best effort
> stuff. We want the machine to run at capacity, not over it; plus we
> need to respond instantly for demands of the mandatory work, which
> usually involves either dialling down or pushing away best effort work.
> In this situation, action is taken long before the swap paths become
> active because if they activate, the entire machine bogs and you've just
> blown the SLA on the mandatory work.
>
> > It's true, those that are memory-rich and can spend nearly
> > infinite amounts on more RAM (and on high-end platforms that
> > can expand to hold massive amounts of RAM) are not tmem's
> > target audience.
>
> Where do you get the infinite RAM idea from? The most concrete example
> of what I said above are Lean Data Centres, which are highly resource
> constrained but they want to run at (or just below) criticality so that
> they get through all of the Mandatory and as much of the best effort
> work as they can.

OK, I think you are asking the same question as I answered for
Kame earlier today.

By "infinite" I am glibly describing any environment where the
data centre administrator positively knows the maximum working
set of every machine (physical or virtual) and can ensure in
advance that the physical RAM always exceeds that maximum
working set. As you say, these machines need not be configured
with a swap device as they, by definition, will never swap.

The point of tmem is to use RAM more efficiently by taking
advantage of all the unused RAM when the current working set
size is less than the maximum working set size. This is very
common in many data centers too, especially virtualized. It
turned out that an identical set of hooks made pagecache compression
possible, and swappage compression more flexible than zram,
and that became the single-kernel user, zcache.

RAM optimization and QoS guarantees are generally mutually
exclusive, so this doesn't seem like a good test case for tmem
(but see below).

> > > This isn't an inherent design flaw, but it does ask the question "is
> > > your design scope too narrow?"
> >
> > Considering all the hazing that I've gone through to get
> > this far, you think I should _expand_ my design scope?!? :-)
> > Thanks, I guess I'll pass. :-)

(Sorry again for the sarcasm :-(

> Sure, I think the conclusion that Transcendent Memory has no
> applicability to a lean Data Centre isn't unreasonable; I was just
> probing to see if it was the only conclusion.

Now that I understand it better, I think it does have
a limited application for your Lean Data Centre...
but only to optimize the "best effort" part of the
data centre workload. That would probably be a relatively
easy enhancement... but, please, my brain is full now and
my typing fingers hurt, so can we consider it post-merge?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Theodore Tso
2011-11-03 10:30:07 UTC
Permalink
On Nov 2, 2011, at 4:08 PM, Dan Magenheimer wrote:

> By "infinite" I am glibly describing any environment where the
> data centre administrator positively knows the maximum working
> set of every machine (physical or virtual) and can ensure in
> advance that the physical RAM always exceeds that maximum
> working set. As you say, these machines need not be configured
> with a swap device as they, by definition, will never swap.
>
> The point of tmem is to use RAM more efficiently by taking
> advantage of all the unused RAM when the current working set
> size is less than the maximum working set size. This is very
> common in many data centers too, especially virtualized.

That doesn't match with my experience, especially with "cloud" deployments, where in order to make the business plans work, the machines tend to be memory constrained, since you want to pack a large number of jobs/VM's onto a single machine, and high density memory is expensive and/or you are DIMM slot constrained. Of course, if you are running multiple Java runtimes in each guest OS (i.e., an J2EE server, and another Java VM for management, and yet another Java VM for the backup manager, etc. --- really, I've seen cloud architectures that work that way), things get worst even faster….

-- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 20:19:39 UTC
Permalink
> From: Avi Kivity [mailto:***@redhat.com]
> Sent: Wednesday, November 02, 2011 9:45 AM
> To: James Bottomley
> Cc: Andrea Arcangeli; Dan Magenheimer; Pekka Enberg; Cyclonus J; Sasha Levin; Christoph Hellwig; David
> Rientjes; Linus Torvalds; linux-***@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy Fitzhardinge;
> Seth Jennings; ***@vflare.org; Chris Mason; ***@novell.com; Dave Hansen; Jonathan Corbet
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On 11/01/2011 12:16 PM, James Bottomley wrote:
> > Actually, I think there's an unexpressed fifth requirement:
> >
> > 5. The optimised use case should be for non-paging situations.
> >
> > The problem here is that almost every data centre person tries very hard
> > to make sure their systems never tip into the swap zone. A lot of
> > hosting datacentres use tons of cgroup controllers for this and
> > deliberately never configure swap which makes transcendent memory
> > useless to them under the current API. I'm not sure this is fixable,
> > but it's the reason why a large swathe of users would never be
> > interested in the patches, because they by design never operate in the
> > region transcended memory is currently looking to address.
> >
> > This isn't an inherent design flaw, but it does ask the question "is
> > your design scope too narrow?"
>
> If you look at cleancache, then it addresses this concern - it extends
> pagecache through host memory. When dropping a page from the tail of
> the LRU it first goes into tmem, and when reading in a page from disk
> you first try to read it from tmem. However in many workloads,
> cleancache is actually detrimental. If you have a lot of cache misses,
> then every one of them causes a pointless vmexit; considering that
> servers today can chew hundreds of megabytes per second, this adds up.
> On the other side, if you have a use-once workload, then every page that
> falls of the tail of the LRU causes a vmexit and a pointless page copy.

I agree with everything you've said except "_many_ workloads".
I would characterize this as "some" workloads, and increasingly
fewer machines... because core-counts are increasing faster than
the ability to attach RAM to them (according to published research).

I did code a horrible hack to fix this, but haven't gotten back
to RFC'ing it to see if there were better, less horrible, ideas.
It essentially only puts into tmem pages that are being reclaimed
but previously had the PageActive bit set... a smaller but
higher-hit-ratio source of pages, I think.

Anyway, I've been very open about this (see
https://lkml.org/lkml/2011/8/29/225 , but it affects cleancache.
Frontswap ONLY deals with pages that would have otherwise
been swapin/swapout to a physical swap device.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-31 21:45:25 UTC
Permalink
> From: Andrea Arcangeli [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote:
> > First, there are several companies and several unaffiliated kernel
> > developers contributing here, building on top of frontswap. I happen
> > to be spearheading it, and my company is backing me up. (It
> > might be more appropriate to note that much of the resistance comes
> > from people of your company... but please let's keep our open-source
> > developer hats on and have a technical discussion rather than one
> > which pleases our respective corporate overlords.)
>
> Fair enough to want an independent review but I'd be interesting to
> also know how many of the several companies and unaffiliated kernel
> developers are contributing to it that aren't using tmem with Xen.

Well just to summarize the non-Oracle-non-tmem supportive responses so far
to this frontswap thread:

Nitin Gupta, for zcache
Brian King (IBM), for Linux on Power
Sasha Levin and Neo Jia, affiliation unspecified, working on tmem for KVM
Ed Tomlinson, affiliation unspecified, end-user of zcache

This doesn't count those that replied offlist to Linus to support the
merging of cleancache earlier this year, and doesn't count the fair
number of people who have offlist asked me about zcache or if KVM
supports tmem or when RAMster will be ready. I suppose I could
do a better job advertising others' interest...

> Note, Hugh is working for another company... and they're using cgroups
> not KVM nor Xen, so I suggests he'd be a fair reviewer from a non-virt
> standpoint, if he hopefully has the time to weight in.

I spent an hour with Hugh at Google this summer, and he (like you)
expressed some dislike of the ABI/API and the hooks but he has since
told both me and Andrew he doesn't have time to pursue this.

Others in Google have shown vague interest in tmem for cgroups but
I've been too busy myself to even think about that.

> However keep in mind if we'd see something that can allow KVM to run
> even faster, we'd be quite silly in not taking advantage of it too, to
> beat our own SPECvirt record. The whole design idea of KVM (unlike
> Xen) is to reuse the kernel improvements as much as possible so when
> the guest runs faster the hypervisor also runs faster with the exact
> same code. Problem a vmexit doing a bounce buffer every 4k doesn't mix
> well into SPECvirt in my view and that probably is what has kept us
> from making any attempt to use tmem API anywhere.

If SPECvirt does any swapping that actually goes to disk (doubtful?),
frontswap will help.

Personally, I think SPECvirt was hand-designed by VMware to favor
their platform, but they were chagrined to find that you and KVM
cleverly re-implemented transparent content-based page sharing
which was the feature for which they were designing SPECvirt.
IOW, SPECvirt is benchmarketing not benchmarking... but I know
that's important too. :-)

Sorry for the topic drift...

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-01 18:21:20 UTC
Permalink
> From: James Bottomley [mailto:***@HansenPartnership.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> Actually, I think there's an unexpressed fifth requirement:
>
> 5. The optimised use case should be for non-paging situations.

Not quite sure what you mean here (especially for frontswap)...

> The problem here is that almost every data centre person tries very hard
> to make sure their systems never tip into the swap zone. A lot of
> hosting datacentres use tons of cgroup controllers for this and
> deliberately never configure swap which makes transcendent memory
> useless to them under the current API. I'm not sure this is fixable,

I can't speak for cgroups, but the generic "state-of-the-art"
that you describe is a big part of what frontswap DOES try
to fix, or at least ameliorate. Tipping "into the swap zone"
is currently very bad. Very very bad. Frontswap doesn't
"solve" swapping, but it is the foundation for some of the
first things in a long time that aren't just "add more RAM."

> but it's the reason why a large swathe of users would never be
> interested in the patches, because they by design never operate in the
> region transcended memory is currently looking to address.

It's true, those that are memory-rich and can spend nearly
infinite amounts on more RAM (and on high-end platforms that
can expand to hold massive amounts of RAM) are not tmem's
target audience.

> This isn't an inherent design flaw, but it does ask the question "is
> your design scope too narrow?"

Considering all the hazing that I've gone through to get
this far, you think I should _expand_ my design scope?!? :-)
Thanks, I guess I'll pass. :-)

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-03 14:59:17 UTC
Permalink
> From: Theodore Tso [mailto:***@mit.edu]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Ted --

Thanks for your reply!

> On Nov 2, 2011, at 4:08 PM, Dan Magenheimer wrote:
>
> > By "infinite" I am glibly describing any environment where the
> > data centre administrator positively knows the maximum working
> > set of every machine (physical or virtual) and can ensure in
> > advance that the physical RAM always exceeds that maximum
> > working set. As you say, these machines need not be configured
> > with a swap device as they, by definition, will never swap.
> >
> > The point of tmem is to use RAM more efficiently by taking
> > advantage of all the unused RAM when the current working set
> > size is less than the maximum working set size. This is very
> > common in many data centers too, especially virtualized.
>
> That doesn't match with my experience, especially with "cloud" deployments, where in order to make the
> business plans work, the machines tend to be memory constrained, since you want to pack a large number
> of jobs/VM's onto a single machine, and high density memory is expensive and/or you are DIMM slot
> constrained. Of course, if you are running multiple Java runtimes in each guest OS (i.e., an J2EE
> server, and another Java VM for management, and yet another Java VM for the backup manager, etc. ---
> really, I've seen cloud architectures that work that way), things get worst even faster..

Hmmm... since your memory-constrained example is highly
similar to one I use in my presentations, I _think_ we are
in total agreement, but I am confused by "doesn't match
with my experience", or maybe you are countering James'
lean data centre example?

To clarify, for a multi-tenancy environment (such as
virtualization or RAMster), tmem enables the ability
to redistribute the constrained RAM resource, i.e.
"steal from the rich and give to the poor," which is
otherwise very difficult because each kernel is a
memory hog. Frontswap's role is really to announce
"I'm overconstrained and am about to swap to disk,
which would be embarrassing for my performance...
can someone hold this swap page for me, please?"

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-31 20:58:39 UTC
Permalink
> From: Andrea Arcangeli [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Andrea --

Thanks for your input. It's good to have some real technical
discussion about the core of tmem. I hope you will
take the time to read and consider my reply,
and comment on any disagreements.

OK, let's go over your concerns about the "flawed API."

> 1) 4k page limit (no way to handle hugepages)

FALSE. The API/ABI was designed from the beginning to handle
different pagesizes. It can even dynamically handle more than
one page size, though a different "pool" must be created on
the kernel side for each different pagesize. (At the risk
of derision, remember I used to code for IA64 so I am
very familiar with different pagesizes.)

It is true that the current tmem _backends_ (Xen and
zcache) reject pagesizes other than 4K, but if there are
"frontends" that have a different pagesize, the API/ABI
supports it.

For hugepages, I agree copying 2M seems odd. But talking
about hugepages in the swap subsystem, I think we are
talking about a very remote future. (Remember cleancache
is _already_ merged so I'm limiting this to swap.) Perhaps
in that far future, Intel will have an optimized "copy2M"
instruction that can circumvent cache pollution?

> 2) synchronous

TRUE. (Well, mostly.... RAMster is exploiting some asynchrony
but that's all still experimental.)

Remember the whole point of tmem/cleancache/frontswap is in
environments where memory is scarce and CPU is plentiful,
which is increasingly common (especially in virtualization).
We all cut our teeth on kernel work in an environment where
saving every CPU cycle was important, but in these new
memory-constrained many-core environments, the majority of
CPU cycles are idle. So does it really matter if the CPU is
idle because it is waiting on the disk vs being used for
synchronous copying/compression/dedup? See the published
Xen benchmarks: CPU utilization goes up, but throughput
goes up too. Why? Because physical memory is being used
more efficiently.

Also IMHO the reason the frontswap hooks and the cleancache
hooks can be so simple and elegant and can support many
different users is because the API/ABI is synchronous.
If you change that, I think you will introduce all sorts
of special cases and races and bugs on both sides of the
ABI/API. And (IMHO) the end result is that most CPUs
are still mostly sitting idle waiting for work to do.

> 3) not zerocopy, requires one bounce buffer for every get and one
> bounce buffer again for every put (like highmem I/O with 32bit pci)

Hmmm... not sure I understand this one. It IS copy-based
so is not zerocopy; the page of data is actually moving out
of memory controlled/directly-addressable by the kernel into
memory that is not controlled/directly-addressable by the kernel.
But neither the Xen implementation nor the zcache implementation
uses any bounce buffers, even when compressing or dedup'ing.

So unless I misunderstand, this one is FALSE.

> 4) can't handle batched requests

TRUE. Tell me again why a vmexit/vmenter per 4K page is
"impossible"? Again you are assuming (1) the CPU had some
real work to do instead and (2) that vmexit/vmenter is horribly
slow. Even if vmexit/vmenter is thousands of cycles, it is still
orders of magnitude faster than a disk access. And vmexit/vmenter
is about the same order of magnitude as page copy, and much
faster than compression/decompression, both of which still
result in a nice win.

You are also assuming that frontswap puts/gets are highly
frequent. By definition they are not, because they are
replacing single-page disk reads/writes due to swapping.

That said, the API/ABI is very extensible, so if it were
proven that batching was sufficiently valuable, it could
be added later... but I don't see it as a showstopper.
Really do you?

> worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle
> db memory so you won't hit that bounce buffering ever with
> Oracle. Also note, historically there's nobody that hated bounce
> buffers more than Oracle (at least I remember the highmem issues with
> pci32 cards :). Also Oracle was the biggest user of hugetlbfs.

I already noted that there's no bounce buffers, but Oracle is
not pursuing this because of the Oracle _database_ (though
it does work on single node databases). While "Oracle" is
often used to equate to its eponymous database, tmem works
on lots of workloads and Oracle (even pre-Sun-merger) sells
tons of non-DB software. In fact I personally take some heat
for putting more emphasis on getting tmem into Linux than in
using it to proprietarily improve other Oracle products.

> If I'm wrong please correct me, I hadn't lots of time to check
> code. But we already raised these points before without much answer.

OK, so you're wrong on two of the points and I've corrected
you. On two of the points, synchrony and non-batchability,
you make claims that (1) these are bad and (2) that there
is a better way to achieve the same results with asynchrony
and batchability.

I do agree you've raised the points before, but I am pretty
sure I've always given the same answers, so you shouldn't
say that you haven't gotten "much answer" but that you disagree
with the answer you got.

I've got working code, it's going in real distros and products and
has growing usage by (non-Oracle) kernel developers as well as
real users clamoring for it or already using it. You claim
that by making it asynchronous it would be better, while I claim
that it would make it impossibly complicated. (We'd essentially
be rewriting, or creating a parallel, blkio subsystem.) You claim
that a batch interface is necessary, while I claim that if it is
proven that it is needed, it could be added later.

We've been talking about this since July 2009, right?
If you can do it better, where's your code? I have the
highest degree of respect for your abilities and I have no
doubt that you could do something similar for KVM over a
long weekend... but can you also make it work for Xen, for
in-kernel compression, and for cross-kernel clustering
(not to mention for other "users" in my queue)? The foundation
tmem code in the core kernel (frontswap and cleancache)
is elegant in its simplicity and _it works_.

REALLY no disrespect intended and I'm sorry if I am flaming,
so let me calm down by quoting Linus from the LWN KS2011
article:

"[Linus] stated that, simply, code that actually is used is
code that is actually worth something... code aimed at
solving the same problem is just a vague idea that is
worthless by comparison... Even if it truly is crap,
we've had crap in the kernel before. The code does not
get better out of tree."

So, please, all the other parts necessary for tmem are
already in-tree, why all the resistance about frontswap?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-10-31 22:37:17 UTC
Permalink
On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote:
> Hmmm... not sure I understand this one. It IS copy-based
> so is not zerocopy; the page of data is actually moving out

copy-based is my main problem, being synchronous is no big deal I
agree.

I mean, I don't see why you have to make one copy before you start
compressing and then you write to disk the output of the compression
algorithm. To me it looks like this API forces on zcache one more copy
than necessary.

I can't see why this copy is necessary and why zcache isn't working on
"struct page" on core kernel structures instead of moving the memory
off to a memory object invisible to the core VM.

> TRUE. Tell me again why a vmexit/vmenter per 4K page is
> "impossible"? Again you are assuming (1) the CPU had some

It's sure not impossible, it's just impossible we want it as it'd be
too slow.

> real work to do instead and (2) that vmexit/vmenter is horribly

Sure the CPU has another 1000 VM to schedule. This is like saying
virtio-blk isn't needed on desktop virt becauase the desktop isn't
doing much I/O. Absurd argument, there are another 1000 desktops doing
I/O at the same time of course.

> slow. Even if vmexit/vmenter is thousands of cycles, it is still
> orders of magnitude faster than a disk access. And vmexit/vmenter

I fully agree tmem is faster for Xen than no tmem. That's not the
point, we don't need such an articulate hack hiding pages from the
guest OS in order to share pagecache, our hypervisor is just a bit
more powerful and has a function called file_read_actor that does what
your tmem copy does...

> is about the same order of magnitude as page copy, and much
> faster than compression/decompression, both of which still
> result in a nice win.

Saying it's a small overhead, is not like saying it is _needed_. Why
not add a udelay(1) in it too? Sure it won't be noticeable.

> You are also assuming that frontswap puts/gets are highly
> frequent. By definition they are not, because they are
> replacing single-page disk reads/writes due to swapping.

They'll be as frequent as the highmem bounce buffers...

> That said, the API/ABI is very extensible, so if it were
> proven that batching was sufficiently valuable, it could
> be added later... but I don't see it as a showstopper.
> Really do you?

That's fine with me... but like ->writepages it'll take ages for the
fs to switch from writepage to writepages. Considering this is a new
API I don't think it's unreasonable to ask at least it to handle
immediately zerocopy behavior. So showing the userland mapping to the
tmem layer so it can avoid the copy and read from the userland
address. Xen will badly choke if ever tries to do that, but zcache
should be ok with that.

Now there may be algorithms where the page must be stable, but others
will be perfectly fine even if the page is changing under the
compression, and in that case the page won't be discarded and it'll be
marked dirty again. So even if a wrong data goes on disk, we'll
rewrite later. I see no reason why there has always to be a copy
before starting any compression/encryption as long as the algorithm
will not crash its input data isn't changing under it.

The ideal API would be to send down page pointers (and handling
compound pages too), not to copy. Maybe with a flag where you can also
specify offsets so you can send down partial pages too down to a byte
granularity. The "copy input data before anything else can happen"
looks flawed to me. It is not flawed for Xen because Xen has no
knowledge of the guest "struct page" but her I'm talking about the
not-virt usages.

> So, please, all the other parts necessary for tmem are
> already in-tree, why all the resistance about frontswap?

Well my comments are generic not specific to frontswap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 21:14:16 UTC
Permalink
> From: Rik van Riel [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On 10/31/2011 07:36 PM, Dan Magenheimer wrote:
> >> From: Andrea Arcangeli [mailto:***@redhat.com]
>
> >>> real work to do instead and (2) that vmexit/vmenter is horribly
> >>
> >> Sure the CPU has another 1000 VM to schedule. This is like saying
> >> virtio-blk isn't needed on desktop virt becauase the desktop isn't
> >> doing much I/O. Absurd argument, there are another 1000 desktops doing
> >> I/O at the same time of course.
> >
> > But this is truly different, I think at least for the most common
> > cases, because the guest is essentially out of physical memory if it
> > is swapping. And the vmexit/vmenter (I assume, I don't really
> > know KVM) gives the KVM scheduler the opportunity to schedule
> > another of those 1000 VMs if it wishes.
>
> I believe the problem Andrea is trying to point out here is
> that the proposed API cannot handle a batch of pages to be
> pushed into frontswap/cleancache at one time.

That wasn't the part of Andrea's discussion I meant, but I
am getting foggy now, so let's address your point rather
than mine.

> Even if the current back-end implementations are synchronous
> and can only do one page at a time, I believe it would still
> be a good idea to have the API able to handle a vector with
> a bunch of pages all at once.
>
> That way we can optimize the back-ends as required, at some
> later point in time.
>
> If enough people start using tmem, such bottlenecks will show
> up at some point :)

It occurs to me that batching could be done locally without
changing the in-kernel "API" (i.e. frontswap_ops)... the
guest-side KVM tmem-backend-driver could do the compression
into guest-side memory and make a single
hypercall=vmexit/vmenter whenever it has collected enough for
a batch. The "get" and "flush" would have to search this guest-side
local cache and, if not local, make a hypercall.

This is more or less what RAMster does, except it (currently)
still transmits the "batch" one (pre-compressed) page at a time.

And, when I think about it deeper (with my currently admittedly
fried brain), this may even be the best way to do batching
anyway. I can't think offhand where else you would put
a "put batch" hook in the swap subsystem because I think
the current swap subsystem batching code only works with
adjacent "entry" numbers.

And, one more thing occurs to me then... this shows the KVM
"ABI" (hypercall) is not constrained by the existing Xen
ABI. It can be arbitrarily more functional.

/me gets hand slapped remotely from Oracle HQ ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Rik van Riel
2011-11-15 16:29:27 UTC
Permalink
On 11/02/2011 05:14 PM, Dan Magenheimer wrote:

> It occurs to me that batching could be done locally without
> changing the in-kernel "API" (i.e. frontswap_ops)... the
> guest-side KVM tmem-backend-driver could do the compression
> into guest-side memory and make a single
> hypercall=vmexit/vmenter whenever it has collected enough for
> a batch.

That seems like the best way to do it, indeed.

Do the current hooks allow that mode of operation,
or do the hooks only return after the entire operation
has completed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jeremy Fitzhardinge
2011-11-15 17:33:40 UTC
Permalink
On 11/15/2011 08:29 AM, Rik van Riel wrote:
> On 11/02/2011 05:14 PM, Dan Magenheimer wrote:
>
>> It occurs to me that batching could be done locally without
>> changing the in-kernel "API" (i.e. frontswap_ops)... the
>> guest-side KVM tmem-backend-driver could do the compression
>> into guest-side memory and make a single
>> hypercall=vmexit/vmenter whenever it has collected enough for
>> a batch.
>
> That seems like the best way to do it, indeed.
>
> Do the current hooks allow that mode of operation,
> or do the hooks only return after the entire operation
> has completed?

The APIs are synchronous, but need only return once the memory has been
dealt with in some way. If you were batching before making a hypercall,
then the implementation would just have to make a copy into its private
memory and you'd have to make sure that lookups on batched but
unsubmitted pages work.

(It's been a while since I've looked at these patches, but I'm assuming
nothing fundamental has changed about them lately.)

J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Konrad Rzeszutek Wilk
2011-11-16 14:49:59 UTC
Permalink
On Tue, Nov 15, 2011 at 09:33:40AM -0800, Jeremy Fitzhardinge wrote:
> On 11/15/2011 08:29 AM, Rik van Riel wrote:
> > On 11/02/2011 05:14 PM, Dan Magenheimer wrote:
> >
> >> It occurs to me that batching could be done locally without
> >> changing the in-kernel "API" (i.e. frontswap_ops)... the
> >> guest-side KVM tmem-backend-driver could do the compression
> >> into guest-side memory and make a single
> >> hypercall=vmexit/vmenter whenever it has collected enough for
> >> a batch.
> >
> > That seems like the best way to do it, indeed.
> >
> > Do the current hooks allow that mode of operation,
> > or do the hooks only return after the entire operation
> > has completed?
>
> The APIs are synchronous, but need only return once the memory has been
> dealt with in some way. If you were batching before making a hypercall,
> then the implementation would just have to make a copy into its private
> memory and you'd have to make sure that lookups on batched but
> unsubmitted pages work.
>
> (It's been a while since I've looked at these patches, but I'm assuming
> nothing fundamental has changed about them lately.)

Yup, what you describe is possible, and nothing fundamental has changed about
them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-31 23:36:04 UTC
Permalink
> From: Andrea Arcangeli [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote:
> > Hmmm... not sure I understand this one. It IS copy-based
> > so is not zerocopy; the page of data is actually moving out
>
> copy-based is my main problem, being synchronous is no big deal I
> agree.
>
> I mean, I don't see why you have to make one copy before you start
> compressing and then you write to disk the output of the compression
> algorithm. To me it looks like this API forces on zcache one more copy
> than necessary.
>
> I can't see why this copy is necessary and why zcache isn't working on
> "struct page" on core kernel structures instead of moving the memory
> off to a memory object invisible to the core VM.

Do you see code doing this? I am pretty sure zcache is
NOT doing an extra copy, it is compressing from the source
page. And I am pretty sure Xen tmem is not doing the
extra copy either.

Seth and I had discussed ADDING the extra copy in zcache
to make the synchronous/irq-disabled time shorter for puts
and doing the compression as a separate thread, but I
don't think I have seen any patch to implement that.

So if this is true (no extra copy), are you happy?

Maybe you are saying that the extra copy would be necessary
in a KVM implementation of tmem? If so, I haven't thought
about a KVM+tmem design enough to comment on that.

> > TRUE. Tell me again why a vmexit/vmenter per 4K page is
> > "impossible"? Again you are assuming (1) the CPU had some
>
> It's sure not impossible, it's just impossible we want it as it'd be
> too slow.

You are clearly speculating here. Wouldn't it be nice to
try it and find out?

> > real work to do instead and (2) that vmexit/vmenter is horribly
>
> Sure the CPU has another 1000 VM to schedule. This is like saying
> virtio-blk isn't needed on desktop virt becauase the desktop isn't
> doing much I/O. Absurd argument, there are another 1000 desktops doing
> I/O at the same time of course.

But this is truly different, I think at least for the most common
cases, because the guest is essentially out of physical memory if it
is swapping. And the vmexit/vmenter (I assume, I don't really
know KVM) gives the KVM scheduler the opportunity to schedule
another of those 1000 VMs if it wishes.

Also I'll venture to guess (without any proof) that the path through
the blkio subsystem to deal with any swap page and set up the disk
I/O is not much shorter than the cost of a vmexit/vmenter on
modern systems ;-)

Now we are both speculating. :-)

> > slow. Even if vmexit/vmenter is thousands of cycles, it is still
> > orders of magnitude faster than a disk access. And vmexit/vmenter
>
> I fully agree tmem is faster for Xen than no tmem. That's not the
> point, we don't need such an articulate hack hiding pages from the
> guest OS in order to share pagecache, our hypervisor is just a bit
> more powerful and has a function called file_read_actor that does what
> your tmem copy does...

Well either then KVM doesn't need frontswap at all and need
not be interfering with a patch that works fine for the
other users, or Sasha and Neo will implement it and find
that frontswap does (sometimes?) provide some benefits.

In either case, I'm not sure why you would be objecting
to merging frontswap.

> > is about the same order of magnitude as page copy, and much
> > faster than compression/decompression, both of which still
> > result in a nice win.
>
> Saying it's a small overhead, is not like saying it is _needed_. Why
> not add a udelay(1) in it too? Sure it won't be noticeable.

Actually the current implementation of RAMster over LAN adds
quite a bit more than udelay(1). But that's all still experimental.
It might be interesting to try adding udelay(1) in zcache
to see if there is any noticeable effect.

> > You are also assuming that frontswap puts/gets are highly
> > frequent. By definition they are not, because they are
> > replacing single-page disk reads/writes due to swapping.
>
> They'll be as frequent as the highmem bounce buffers...

I don't understand. Sorry, I really am ignorant of
highmem systems as I grew up on PA-RISC and IA-64.

> > That said, the API/ABI is very extensible, so if it were
> > proven that batching was sufficiently valuable, it could
> > be added later... but I don't see it as a showstopper.
> > Really do you?
>
> That's fine with me... but like ->writepages it'll take ages for the
> fs to switch from writepage to writepages. Considering this is a new
> API I don't think it's unreasonable to ask at least it to handle
> immediately zerocopy behavior. So showing the userland mapping to the
> tmem layer so it can avoid the copy and read from the userland
> address. Xen will badly choke if ever tries to do that, but zcache
> should be ok with that.
>
> Now there may be algorithms where the page must be stable, but others
> will be perfectly fine even if the page is changing under the
> compression, and in that case the page won't be discarded and it'll be
> marked dirty again. So even if a wrong data goes on disk, we'll
> rewrite later. I see no reason why there has always to be a copy
> before starting any compression/encryption as long as the algorithm
> will not crash its input data isn't changing under it.
>
> The ideal API would be to send down page pointers (and handling
> compound pages too), not to copy. Maybe with a flag where you can also
> specify offsets so you can send down partial pages too down to a byte
> granularity. The "copy input data before anything else can happen"
> looks flawed to me. It is not flawed for Xen because Xen has no
> knowledge of the guest "struct page" but her I'm talking about the
> not-virt usages.

Again, I think you are assuming things work differently than
I think they do. I don't think there is an extra copy before
the compression. And Xen isn't choking, nor is zcache.
(Note that the Xen tmem implementation, as all of Xen will be
soon, is 64-bit only... Seth recently fixed a bug keeping
zcache from working in 32-bit highmem systems, so I know
32-bit works for zcache.)

So if this is true (no extra copy), are you happy?

> > So, please, all the other parts necessary for tmem are
> > already in-tree, why all the resistance about frontswap?
>
> Well my comments are generic not specific to frontswap.

OK, but cleancache is already in-tree and open to any improvement
ideas you may have. Frontswap is only using the existing ABI/API
that cleancache already uses.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-01 01:20:17 UTC
Permalink
On Mon, Oct 31, 2011 at 04:36:04PM -0700, Dan Magenheimer wrote:
> Do you see code doing this? I am pretty sure zcache is
> NOT doing an extra copy, it is compressing from the source
> page. And I am pretty sure Xen tmem is not doing the
> extra copy either.

So below you describe put as a copy of a page from the kernel into the
newly allocated PAM space... I guess there's some improvement needed
for the documentation at least, a compression is done sometime instead
of a copy... I thought you always had to copy first sorry.

* "Put" a page, e.g. copy a page from the kernel into newly allocated
* PAM space (if such space is available). Tmem_put is complicated by
* a corner case: What if a page with matching handle already exists in
* tmem? To guarantee coherency, one of two actions is necessary: Either
* the data for the page must be overwritten, or the page must be
* "flushed" so that the data is not accessible to a subsequent "get".
* Since these "duplicate puts" are relatively rare, this implementation
* always flushes for simplicity.
*/
int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
char *data, size_t size, bool raw, bool ephemeral)
{
struct tmem_obj *obj = NULL, *objfound = NULL, *objnew = NULL;
void *pampd = NULL, *pampd_del = NULL;
int ret = -ENOMEM;
struct tmem_hashbucket *hb;

hb = &pool->hashbucket[tmem_oid_hash(oidp)];
spin_lock(&hb->lock);
obj = objfound = tmem_obj_find(hb, oidp);
if (obj != NULL) {
pampd = tmem_pampd_lookup_in_obj(objfound, index);
if (pampd != NULL) {
/* if found, is a dup put, flush the old one */
pampd_del = tmem_pampd_delete_from_obj(obj, index);
BUG_ON(pampd_del != pampd);
(*tmem_pamops.free)(pampd, pool, oidp, index);
if (obj->pampd_count == 0) {
objnew = obj;
objfound = NULL;
}
pampd = NULL;
}
} else {
obj = objnew = (*tmem_hostops.obj_alloc)(pool);
if (unlikely(obj == NULL)) {
ret = -ENOMEM;
goto out;
}
tmem_obj_init(obj, hb, pool, oidp);
}
BUG_ON(obj == NULL);
BUG_ON(((objnew != obj) && (objfound != obj)) || (objnew == objfound));
pampd = (*tmem_pamops.create)(data, size, raw, ephemeral,
obj->pool, &obj->oid, index);

So then .create is calls zcache_pampd_create...

static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
^^^^^^^^^^
struct tmem_pool *pool, struct tmem_oid *oid,
uint32_t index)
{
void *pampd = NULL, *cdata;
size_t clen;
int ret;
unsigned long count;
struct page *page = (struct page *)(data);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct zcache_client *cli = pool->client;
uint16_t client_id = get_client_id_from_client(cli);
unsigned long zv_mean_zsize;
unsigned long curr_pers_pampd_count;
u64 total_zsize;

if (eph) {
ret = zcache_compress(page, &cdata, &clen);


zcache_compress then does:

static int zcache_compress(struct page *from, void **out_va, size_t *out_len)
{
int ret = 0;
unsigned char *dmem = __get_cpu_var(zcache_dstmem);
unsigned char *wmem = __get_cpu_var(zcache_workmem);
char *from_va;

BUG_ON(!irqs_disabled());
if (unlikely(dmem == NULL || wmem == NULL))
goto out; /* no buffer, so can't compress */
from_va = kmap_atomic(from, KM_USER0);
mb();
ret = lzo1x_1_compress(from_va, PAGE_SIZE, dmem, out_len, wmem);
^^^^^^^^^

tmem is called from frontswap_put_page.

+int __frontswap_put_page(struct page *page)
+{
+ int ret = -1, dup = 0;
+ swp_entry_t entry = { .val = page_private(page), };
+ int type = swp_type(entry);
+ struct swap_info_struct *sis = swap_info[type];
+ pgoff_t offset = swp_offset(entry);
+
+ BUG_ON(!PageLocked(page));
+ BUG_ON(sis == NULL);
+ if (frontswap_test(sis, offset))
+ dup = 1;
+ ret = (*frontswap_ops.put_page)(type, offset, page);

In turn called by swap_writepage:

@@ -98,6 +99,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
unlock_page(page);
goto out;
}
+ if (frontswap_put_page(page) == 0) {
+ set_page_writeback(page);
+ unlock_page(page);
+ end_page_writeback(page);
+ goto out;
+ }
bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
if (bio == NULL) {
set_page_dirty(page);


And zcache-main.c has #ifdef for both frontswap and cleancache, and
the above frontswap_ops.put_page points to the below
zcache_frontswap_put_page which even shows a local_irq_save() for the
whole time of the compression... did you ever check irq latency with
zcache+frontswap? Wonder what the RT folks will say about
zcache+frontswap considering local_irq_save is a blocker for preempt-RT.

#ifdef CONFIG_CLEANCACHE
#include <linux/cleancache.h>
#endif
#ifdef CONFIG_FRONTSWAP
#include <linux/frontswap.h>
#endif

#ifdef CONFIG_FRONTSWAP
/* a single tmem poolid is used for all frontswap "types" (swapfiles) */
static int zcache_frontswap_poolid = -1;

/*
* Swizzling increases objects per swaptype, increasing tmem concurrency
* for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS
*/
#define SWIZ_BITS 4
#define SWIZ_MASK ((1 << SWIZ_BITS) - 1)
#define _oswiz(_type, _ind) ((_type << SWIZ_BITS) | (_ind & SWIZ_MASK))
#define iswiz(_ind) (_ind >> SWIZ_BITS)

static inline struct tmem_oid oswiz(unsigned type, u32 ind)
{
struct tmem_oid oid = { .oid = { 0 } };
oid.oid[0] = _oswiz(type, ind);
return oid;
}

static int zcache_frontswap_put_page(unsigned type, pgoff_t offset,
struct page *page)
{
u64 ind64 = (u64)offset;
u32 ind = (u32)offset;
struct tmem_oid oid = oswiz(type, ind);
int ret = -1;
unsigned long flags;

BUG_ON(!PageLocked(page));
if (likely(ind64 == ind)) {
local_irq_save(flags);
ret = zcache_put_page(LOCAL_CLIENT, zcache_frontswap_poolid,
&oid, iswiz(ind), page);
local_irq_restore(flags);
}
return ret;
}

/* returns 0 if the page was successfully gotten from frontswap, -1 if
* was not present (should never happen!) */
static int zcache_frontswap_get_page(unsigned type, pgoff_t offset,
struct page *page)
{
u64 ind64 = (u64)offset;
u32 ind = (u32)offset;
struct tmem_oid oid = oswiz(type, ind);
int ret = -1;

BUG_ON(!PageLocked(page));
if (likely(ind64 == ind))
ret = zcache_get_page(LOCAL_CLIENT, zcache_frontswap_poolid,
&oid, iswiz(ind), page);
return ret;
}

/* flush a single page from frontswap */
static void zcache_frontswap_flush_page(unsigned type, pgoff_t offset)
{
u64 ind64 = (u64)offset;
u32 ind = (u32)offset;
struct tmem_oid oid = oswiz(type, ind);

if (likely(ind64 == ind))
(void)zcache_flush_page(LOCAL_CLIENT, zcache_frontswap_poolid,
&oid, iswiz(ind));
}

/* flush all pages from the passed swaptype */
static void zcache_frontswap_flush_area(unsigned type)
{
struct tmem_oid oid;
int ind;

for (ind = SWIZ_MASK; ind >= 0; ind--) {
oid = oswiz(type, ind);
(void)zcache_flush_object(LOCAL_CLIENT,
zcache_frontswap_poolid, &oid);
}
}

static void zcache_frontswap_init(unsigned ignored)
{
/* a single tmem poolid is used for all frontswap "types" (swapfiles) */
if (zcache_frontswap_poolid < 0)
zcache_frontswap_poolid =
zcache_new_pool(LOCAL_CLIENT, TMEM_POOL_PERSIST);
}

static struct frontswap_ops zcache_frontswap_ops = {
.put_page = zcache_frontswap_put_page,
.get_page = zcache_frontswap_get_page,
.flush_page = zcache_frontswap_flush_page,
.flush_area = zcache_frontswap_flush_area,
.init = zcache_frontswap_init
};

struct frontswap_ops zcache_frontswap_register_ops(void)
{
struct frontswap_ops old_ops =
frontswap_register_ops(&zcache_frontswap_ops);

return old_ops;
}
#endif

#ifdef CONFIG_CLEANCACHE
static void zcache_cleancache_put_page(int pool_id,
struct cleancache_filekey key,
pgoff_t index, struct page *page)
{
u32 ind = (u32) index;
struct tmem_oid oid = *(struct tmem_oid *)&key;

if (likely(ind == index))
(void)zcache_put_page(LOCAL_CLIENT, pool_id, &oid, index, page);
}

static int zcache_cleancache_get_page(int pool_id,
struct cleancache_filekey key,
pgoff_t index, struct page *page)
{
u32 ind = (u32) index;
struct tmem_oid oid = *(struct tmem_oid *)&key;
int ret = -1;

if (likely(ind == index))
ret = zcache_get_page(LOCAL_CLIENT, pool_id, &oid, index, page);
return ret;
}

static void zcache_cleancache_flush_page(int pool_id,
struct cleancache_filekey key,
pgoff_t index)
{
u32 ind = (u32) index;
struct tmem_oid oid = *(struct tmem_oid *)&key;

if (likely(ind == index))
(void)zcache_flush_page(LOCAL_CLIENT, pool_id, &oid, ind);
}

static void zcache_cleancache_flush_inode(int pool_id,
struct cleancache_filekey key)
{
struct tmem_oid oid = *(struct tmem_oid *)&key;

(void)zcache_flush_object(LOCAL_CLIENT, pool_id, &oid);
}

static void zcache_cleancache_flush_fs(int pool_id)
{
if (pool_id >= 0)
(void)zcache_destroy_pool(LOCAL_CLIENT, pool_id);
}

static int zcache_cleancache_init_fs(size_t pagesize)
{
BUG_ON(sizeof(struct cleancache_filekey) !=
sizeof(struct tmem_oid));
BUG_ON(pagesize != PAGE_SIZE);
return zcache_new_pool(LOCAL_CLIENT, 0);
}

static int zcache_cleancache_init_shared_fs(char *uuid, size_t pagesize)
{
/* shared pools are unsupported and map to private */
BUG_ON(sizeof(struct cleancache_filekey) !=
sizeof(struct tmem_oid));
BUG_ON(pagesize != PAGE_SIZE);
return zcache_new_pool(LOCAL_CLIENT, 0);
}

static struct cleancache_ops zcache_cleancache_ops = {
.put_page = zcache_cleancache_put_page,
.get_page = zcache_cleancache_get_page,
.flush_page = zcache_cleancache_flush_page,
.flush_inode = zcache_cleancache_flush_inode,
.flush_fs = zcache_cleancache_flush_fs,
.init_shared_fs = zcache_cleancache_init_shared_fs,
.init_fs = zcache_cleancache_init_fs
};

struct cleancache_ops zcache_cleancache_register_ops(void)
{
struct cleancache_ops old_ops =
cleancache_register_ops(&zcache_cleancache_ops);

return old_ops;
}
#endif

This zcache functionality is all but pluggable if you've to create a
new zcache slightly different implementation for each user
(frontswap/cleancache etc...). And the cast of the page when it enters
tmem to char:

static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp,
uint32_t index, struct page *page)
{
struct tmem_pool *pool;
int ret = -1;

BUG_ON(!irqs_disabled());
pool = zcache_get_pool_by_id(cli_id, pool_id);
if (unlikely(pool == NULL))
goto out;
if (!zcache_freeze && zcache_do_preload(pool) == 0) {
/* preload does preempt_disable on success */
ret = tmem_put(pool, oidp, index, (char *)(page),
PAGE_SIZE, 0, is_ephemeral(pool));

Is so weird... and then it returns a page when it exits tmem and
enters zcache again in zcache_pampd_create.

And the "len" get lost at some point inside zcache but I guess that's
fixable and not part of the API at least.... but the whole thing looks
an exercise to pass through tmem. I don't really understand why one
page must become a char at some point and what benefit it would ever
provide.

I also don't understand how you plan to ever swap the compressed data
considering it's hold outside of the kernel not anymore in a struct
page. If swap compression was done right, the on-disk data should be
stored in the compressed format in a compact way so you spend the CPU
once and you also gain disk speed by writing less. How do you plan to
achieve this with this design?

I like the failing when the size of the compressed data is bigger than
the uncompressed one, only in that case the data should go to swap
uncompressed of course. That's something in software we can handle and
hardware can't handle so well and that's why some older hardware
compression for RAM probably didn't takeoff.

I've an hard time to be convinced this is the best way to do swap
compression especially not seeing how it will ever reach swap on
disk. But yes it's not doing an additional copy unlike the tmem_put
comment would imply (it's disabling irqs for the whole duration of the
compression though).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Rik van Riel
2011-11-02 20:51:31 UTC
Permalink
On 10/31/2011 07:36 PM, Dan Magenheimer wrote:
>> From: Andrea Arcangeli [mailto:***@redhat.com]

>>> real work to do instead and (2) that vmexit/vmenter is horribly
>>
>> Sure the CPU has another 1000 VM to schedule. This is like saying
>> virtio-blk isn't needed on desktop virt becauase the desktop isn't
>> doing much I/O. Absurd argument, there are another 1000 desktops doing
>> I/O at the same time of course.
>
> But this is truly different, I think at least for the most common
> cases, because the guest is essentially out of physical memory if it
> is swapping. And the vmexit/vmenter (I assume, I don't really
> know KVM) gives the KVM scheduler the opportunity to schedule
> another of those 1000 VMs if it wishes.

I believe the problem Andrea is trying to point out here is
that the proposed API cannot handle a batch of pages to be
pushed into frontswap/cleancache at one time.

Even if the current back-end implementations are synchronous
and can only do one page at a time, I believe it would still
be a good idea to have the API able to handle a vector with
a bunch of pages all at once.

That way we can optimize the back-ends as required, at some
later point in time.

If enough people start using tmem, such bottlenecks will show
up at some point :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-01 16:41:38 UTC
Permalink
> From: Andrea Arcangeli [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Mon, Oct 31, 2011 at 04:36:04PM -0700, Dan Magenheimer wrote:
> > Do you see code doing this? I am pretty sure zcache is
> > NOT doing an extra copy, it is compressing from the source
> > page. And I am pretty sure Xen tmem is not doing the
> > extra copy either.
>
> So below you describe put as a copy of a page from the kernel into the
> newly allocated PAM space... I guess there's some improvement needed
> for the documentation at least, a compression is done sometime instead
> of a copy... I thought you always had to copy first sorry.

I suppose this documentation (note, it is in drivers/staging/zcache,
not in the proposed frontswap patchset) could be misleading. It is
really tough in a short comment to balance between describing
the general concept to readers trying to understand the big
picture, and the high level of detail needed if you are trying
to really understand what is going on in the code. But one
can always read the code.

> zcache_compress then does:
>
> ret = lzo1x_1_compress(from_va, PAGE_SIZE, dmem, out_len, wmem);
> ^^^^^^^^
>
> tmem is called from frontswap_put_page.
>
> In turn called by swap_writepage:
>
> the above frontswap_ops.put_page points to the below
> zcache_frontswap_put_page which even shows a local_irq_save() for the
> whole time of the compression... did you ever check irq latency with
> zcache+frontswap? Wonder what the RT folks will say about
> zcache+frontswap considering local_irq_save is a blocker for preempt-RT.

This is a known problem: zcache is currently not very
good for high-response RT environments because it currently
compresses a page of data with interrupts disabled, which
takes (IIRC) about 20000 cycles. (I suspect though, without proof,
that this is not the worst irq-disabled path in the kernel.)
As noted earlier, this is fixable at the cost of the extra copy
which could be implemented as an option later if needed.
Or, as always, the RT folks can just not enable zcache.
Or maybe smarter developers than me will find a solution
that will work even better.

Also, yes, as I said, zcache currently is written to assume
4k pagesize, but the tmem.c code/API (see below for more
on that file) is pagesize-independent.

> And zcache-main.c has #ifdef for both frontswap and cleancache
> #ifdef CONFIG_CLEANCACHE
> #include <linux/cleancache.h>
> #endif
> #ifdef CONFIG_FRONTSWAP
> #include <linux/frontswap.h>
> #endif

Yeah, remember zcache was merged before either cleancache or
frontswap, so this ugliness was necessary to get around the
chicken-and-egg problem. Zcache will definitely need some
work before it is ready to move out of staging, and your
feedback here is useful for that, but I don't see that as
condemning frontswap, do you?

> This zcache functionality is all but pluggable if you've to create a
> new zcache slightly different implementation for each user
> (frontswap/cleancache etc...).

Not quite sure what you are saying here, but IIUC, the alternative
was to push the tmem semantics up into the hooks (e.g. into
swapfile.c). This is what the very first tmem patch did, before
I was advised to (1) split cleancache and frontswap so that
they could be reviewed separately and (2) move the details
of tmem into a different "layer" (cleancache.c/h and frontswap.c/h).
So in order to move ugliness out of the core VM, a bit more
ugliness is required in the tmem shim/backend.

> struct page *page = (struct page *)(data);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> And the cast of the page when it enters
> tmem to char:
>
> ret = tmem_put(pool, oidp, index, (char *)(page),
> PAGE_SIZE, 0, is_ephemeral(pool));
>
> Is so weird... and then it returns a page when it exits tmem and
> enters zcache again in zcache_pampd_create.
>
> And the "len" get lost at some point inside zcache but I guess that's
> fixable and not part of the API at least.... but the whole thing looks
> an exercise to pass through tmem. I don't really understand why one
> page must become a char at some point and what benefit it would ever
> provide.

This is the "fix highmem" bug fix from Seth Jennings. The file
tmem.c in zcache is an attempt to separate out the core tmem
functionality and data structures so that it can (eventually)
be in the lib/ directory and be used by multiple backends.
(RAMster uses tmem.c unchanged.) The code in tmem.c reflects
my "highmem-blindness" in that a single pointer is assumed to
be able to address the "PAMPD" (as opposed to a struct page *
and an offset, necessary for a 32-bit highmem system). Seth
cleverly discovered this ugly two-line fix that (at least for now)
avoided major mods to tmem.c.

> I also don't understand how you plan to ever swap the compressed data
> considering it's hold outside of the kernel not anymore in a struct
> page. If swap compression was done right, the on-disk data should be
> stored in the compressed format in a compact way so you spend the CPU
> once and you also gain disk speed by writing less. How do you plan to
> achieve this with this design?

First ignoring frontswap, there is currently no way to move a
page of swap data from one swap device to another swap device
except by moving it first into RAM (in the swap cache), right?
Frontswap doesn't solve that problem either, though it would
be cool if it could. The "partial swapoff" functionality
in the patch, added so that it can be called from frontswap_shrink,
enables pages to be pulled out of frontswap into swap cache
so that they can be moved if desired/necessary onto a real
swap device.

The selfballooning code in drivers/xen calls frontswap_shrink
to pull swap pages out of the Xen hypervisor when memory pressure
is reduced. Frontswap_shrink is not yet called from zcache.

Note, however, that unlike swap-disks, compressed pages in
frontswap CAN be silently moved to another "device". This is
the foundation of RAMster, which moves those compressed pages
to the RAM of another machine. The device _could_ be some
special type of real-swap-disk, I suppose.

> I like the failing when the size of the compressed data is bigger than
> the uncompressed one, only in that case the data should go to swap
> uncompressed of course. That's something in software we can handle and
> hardware can't handle so well and that's why some older hardware
> compression for RAM probably didn't takeoff.

Yes, this is a good example of the most important feature of
tmem/frontswap: Every frontswap_put can be rejected for whatever reason
the tmem backend chooses, entirely dynamically. Not only is it true
that hardware can't handle this well, but the Linux block I/O subsystem
can't handle it either. I've suggested in the frontswap documentation
that this is also a key to allowing "mixed RAM + phase-change RAM"
systems to be useful.

Also I think this is also why many linux vm/vfs/fs/bio developers
"don't like it much" (where "it" is cleancache or frontswap).
They are not used to losing control of data to some other
non-kernel-controlled entity and not used to being told "NO"
when they are trying to move data somewhere. IOW, they are
control freaks and tmem is out of their control so it must
be defeated ;-)

> I've an hard time to be convinced this is the best way to do swap
> compression especially not seeing how it will ever reach swap on
> disk. But yes it's not doing an additional copy unlike the tmem_put
> comment would imply (it's disabling irqs for the whole duration of the
> compression though).

I hope the earlier explanation about frontswap_shrink helps.
It's also good to note that the only other successful Linux
implementation of swap compression is zram, and zram's
creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8)

So where are we now? Are you now supportive of merging
frontswap? If not, can you suggest any concrete steps
that will gain your support?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-01 18:07:02 UTC
Permalink
On Tue, Nov 01, 2011 at 09:41:38AM -0700, Dan Magenheimer wrote:
> I suppose this documentation (note, it is in drivers/staging/zcache,
> not in the proposed frontswap patchset) could be misleading. It is

Yep I gotten the comment from tmem.c in staging, and the lwn link I
read before reading the tmem_put comment also only mentioned about
tmem_put doing a copy. So I erroneously assumed that all memory
passing through tmem was being copied and you lost reference of the
"struct page" when it entered zcache.

But instead there is this obscure cast of a "struct page *" to a "char
*", that is casted back to a struct page * from a char * in zcache
code, and kmap() runs on the page, to avoid the unnecessary copy.

So far so good, now the question is why do you have that cast at all?

I mean it's hard to be convinced on the sanity of on a API that
requires the caller to cast a "struct page *" to a "char *" to run
zerocopy. And well that is the very core tmem_put API I'm talking
about.

I assume the explanation of the cast is: before it was passing
page_address(page) to tmem, but that breaks with highmem because
highmem requires kmap(page). So then you casted the page.

This basically proofs the API must be fixed. In the kernel we work
with _pages_ not char *, exactly for this reason, and tmem_put must be
fixed to take a page structure. (in fact better would be an array of
pages and ranges start/end for each entry in the array but hey at
least a page+len would be sane). A char * is flawed and the cast of
the page to char * and back to struct page, kind of proofs it. So I
think that must be fixed in tmem_put. Unfortunately it's already
merged with this cast back and forth in the upstream kernel.

About the rest of zcache I think it's interesting but because it works
inside tmem I'm unsure how we're going to write it to disk.

The local_irq_save would be nice to understand why it's needed for
frontswap but not for pagecache. All that VM code never runs from
irqs, so it's hard to see how the irq disabling is relevant. A bit fat
comment on why local_irq_save is needed in zcache code (in staging
already) would be helpful. Maybe it's tmem that can run from irq? The
only thing running from irqs is the tlb flush and I/O completion
handlers, everything else in the VM isn't irq/softirq driven so we
never have to clear irqs.

My feeling is this zcache should be based on a memory pool abstraction
that we can write to disk with a bio and working with "pages".

I'm also not sure how you balance the pressure in the tmem pool, when
you fail the allocation and swap to disk, or when you keep moving to
compressed swap.

> This is a known problem: zcache is currently not very
> good for high-response RT environments because it currently
> compresses a page of data with interrupts disabled, which
> takes (IIRC) about 20000 cycles. (I suspect though, without proof,
> that this is not the worst irq-disabled path in the kernel.)

That's certainly more than the irq latency so it's probably something
the rt folks don't want and yes they should keep it in mind not to use
frontswap+zcache in embedded RT environments.

Besides there was no benchmark comparing zram performance to zcache
performance so latency aside we miss a lot of info.

> As noted earlier, this is fixable at the cost of the extra copy
> which could be implemented as an option later if needed.
> Or, as always, the RT folks can just not enable zcache.
> Or maybe smarter developers than me will find a solution
> that will work even better.

And what is the exact reason of the local_irq_save for doing it
zerocopy?

> Yeah, remember zcache was merged before either cleancache or
> frontswap, so this ugliness was necessary to get around the
> chicken-and-egg problem. Zcache will definitely need some
> work before it is ready to move out of staging, and your
> feedback here is useful for that, but I don't see that as
> condemning frontswap, do you?

Would I'd like is a mechanism where you:

1) add swapcache to zcache (with fallback to swap immediately if zcache
allocation fails)

2) when some threshold is hit or zcache allocation fails, we write the
compressed data in a compact way to swap (freeing zcache memory),
or swapcache directly to swap if no zcache is present

3) newly added swapcache is added to zcache (old zcache was written to
swap device compressed and freed)

Once we already did the compression it's silly to write to disk the
uncompressed data. Ok initially it's ok because compacting the stuff
on disk is super tricky but we want a design that will allow writing
the zcache to disk and add new swapcache to zcache, instead of the
current way of swapping the new swapcache to disk uncompressed and not
being able to writeout the compressed zcache.

If nobody called zcache_get and uncompressed it, it means it's
probably less likely to be used than the newly added swapcache that
wants to be compressed.

I'm afraid adding frontswap in this form will still get stuck us in
the wrong model and most of it will have to be dropped and rewritten
to do just the above 3 points I described to do proper swap
compression.

Also I'm skeptical we need to pass through tmem at all to do that. I
mean done right the swap compression could be a feature to enable
across the board without needing tmem at all. Then if you want to add
ramster just add a frontswap on the already compressed
swapcache... before it goes to the hard swap device.

The final swap design must also include the pre-swapout from Avi by
writing data to swapcache in advance and relaying on the dirty bit to
rewrite it. And the pre-swapin as well (original idea from Con). The
pre-swapout would need to stop before compressing. The pre-swapin
should stop before decompressing.

I mean I see an huge potential for improvement in the swap space, just
I guess most are busy with more pressing issues, like James said most
data centers don't use swap, desktop is irrelevant and android (as
relevant as data center) don't use swap.

But your improvement to frontswap don't look the right direction if
you really want to improve swap for the long term. It may be better
than nothing but I don't see it going the way it should go and I
prefer to remove the tmem dependency on zcache all together. Zcache
alone would be way more interesting.

And tmem_put must be fixed to take a page, that cast to char * of a
page, to avoid crashing on highmem is not allowed.

Of course I didn't have the time to read 100% of the code so please
correct me again if I misunderstood something.

> This is the "fix highmem" bug fix from Seth Jennings. The file
> tmem.c in zcache is an attempt to separate out the core tmem
> functionality and data structures so that it can (eventually)
> be in the lib/ directory and be used by multiple backends.
> (RAMster uses tmem.c unchanged.) The code in tmem.c reflects
> my "highmem-blindness" in that a single pointer is assumed to
> be able to address the "PAMPD" (as opposed to a struct page *
> and an offset, necessary for a 32-bit highmem system). Seth
> cleverly discovered this ugly two-line fix that (at least for now)
> avoided major mods to tmem.c.

Well you need to do the major mods, it's not ok to do that cast,
passing pages is correct instead. Let's fix the tmem_put API before
people can use it wrong. Maybe then I'll dislike passing through tmem
less? Dunno.

int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
- char *data, size_t size, bool raw, bool ephemeral)
+ struct page *page, size_t size, bool raw, bool ephemeral)


> First ignoring frontswap, there is currently no way to move a
> page of swap data from one swap device to another swap device
> except by moving it first into RAM (in the swap cache), right?

Yes.

> Frontswap doesn't solve that problem either, though it would
> be cool if it could. The "partial swapoff" functionality
> in the patch, added so that it can be called from frontswap_shrink,
> enables pages to be pulled out of frontswap into swap cache
> so that they can be moved if desired/necessary onto a real
> swap device.

The whole logic deciding the size of the frontswap zcache is going to
be messy. But to do the real swapout you should not pull the memory
out of frontswap zcache, you should write it to disk compacted and
compressed compared to how it was inserted in frontswap... That would
be the ideal.

> The selfballooning code in drivers/xen calls frontswap_shrink
> to pull swap pages out of the Xen hypervisor when memory pressure
> is reduced. Frontswap_shrink is not yet called from zcache.

So I wonder how zcache is dealing with the dynamic size. Or has it a
fixed size? How do you pull pages out of zcache to max out the real
RAM availability?

> Note, however, that unlike swap-disks, compressed pages in
> frontswap CAN be silently moved to another "device". This is
> the foundation of RAMster, which moves those compressed pages
> to the RAM of another machine. The device _could_ be some
> special type of real-swap-disk, I suppose.

Yeah you can do ramster with frontswap+zcache but not writing the
zcache to disk into the swap device. Writing to disk doesn't require
new allocations. Migrating to other node does. And you must deal with
OOM conditions there. Or it'll deadlock. So the basic should be to
write compressed data to disk (which at least can be done reliably for
swapcache, unlike ramster which has the same issues of nfs swapping
and nbd swapping and iscsi sapping) before wondering if to send it to
another node.

> Yes, this is a good example of the most important feature of
> tmem/frontswap: Every frontswap_put can be rejected for whatever reason
> the tmem backend chooses, entirely dynamically. Not only is it true
> that hardware can't handle this well, but the Linux block I/O subsystem
> can't handle it either. I've suggested in the frontswap documentation
> that this is also a key to allowing "mixed RAM + phase-change RAM"
> systems to be useful.

Yes what is not clear is how the size of the zcache is choosen.

> Also I think this is also why many linux vm/vfs/fs/bio developers
> "don't like it much" (where "it" is cleancache or frontswap).
> They are not used to losing control of data to some other
> non-kernel-controlled entity and not used to being told "NO"
> when they are trying to move data somewhere. IOW, they are
> control freaks and tmem is out of their control so it must
> be defeated ;-)

Either tmem works on something that is a core MM structure and is
compatible with all bios and operations we can want to do on memory,
or I've an hard time to think it's a good thing in trying to make the
memory it handles not-kernel-controlled.

This non-kernel-controlled approach to me looks like exactly a
requirement coming from Xen, not really something useful.

There is no reason why a kernel abstraction should stay away from
using kernel data structures like "struct page" just to cast it back
from char * to struct page * when it needs to handle highmem in
zcache. Something seriously wrong is going on there in API terms so
you can start by fixing that bit.

> I hope the earlier explanation about frontswap_shrink helps.
> It's also good to note that the only other successful Linux
> implementation of swap compression is zram, and zram's
> creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8)
>
> So where are we now? Are you now supportive of merging
> frontswap? If not, can you suggest any concrete steps
> that will gain your support?

My problem is this is like zram, like mentioned it only solves the
compression. There is no way it can store the compressed data on
disk. And this is way more complex than zram, and it only makes the
pooling size not fixed at swapon time... so very very small gain and
huge complexity added (again compared to zram). zram in fact required
absolutely zero changes to the VM. So it's hard to see how this is
overall better than zram. If we deal with that amount of complexity we
should at least be a little better than zram at runtime, while this is
same.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 19:06:02 UTC
Permalink
> From: Andrea Arcangeli [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> Hi Dan.
>
> On Tue, Nov 01, 2011 at 02:00:34PM -0700, Dan Magenheimer wrote:
> > Pardon me for complaining about my typing fingers, but it seems
> > like you are making statements and asking questions as if you
> > are not reading the whole reply before you start responding
> > to the first parts. So it's going to be hard to answer each
> > sub-thread in order. So let me hit a couple of the high
> > points first.
>
> I'm actually reading all your reply, if I skip some part it may be
> because the email is too long already :). I'm just trying to
> understand it and I wish I had more time to dedicate to this too but
> I've other pending stuff too.

Hi Andrea --

First, let me apologize for yesterday. I was unnecessarily
sarcastic and disrespectful, and I am sorry. I very much appreciate
your time and discussion, and good hard technical questions
that have allowed me to clarify some of the design and
implementation under discussion.

I agree this email is too long, though it has been very useful.
You've got some great feedback and insights in improving
zcache, so let me be the first to cry "uncle" (surrender)
and cut to the end....

> If you confirm it's free to go and there's no ABI/API we get stuck
> into, I'm fairly positive about it, it's clearly "alpha" feature
> behavior (almost no improvement with zram today) but it could very
> well be in the right direction and give huge benefit compared to zram
> in the future. I definitely don't pretend things to be perfect... but
> they must be in the right design direction for me to be sold off on
> those. Just like KVM in virt space.

Confirmed. Anything below the "struct frontswap_ops" (and
"struct cleancache_ops), that is anything in the staging/zcache
directory, is wide open for your ideas and improvement.
In fact, I would very much welcome your contribution and
I think IBM and Nitin would also.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrea Arcangeli
2011-11-03 00:32:54 UTC
Permalink
On Wed, Nov 02, 2011 at 12:06:02PM -0700, Dan Magenheimer wrote:
> First, let me apologize for yesterday. I was unnecessarily
> sarcastic and disrespectful, and I am sorry. I very much appreciate
> your time and discussion, and good hard technical questions
> that have allowed me to clarify some of the design and
> implementation under discussion.

No problem, I know it must be frustrating to wait so long to get
something merged.

Like somebody already pointed out (and I agree) it'd be nice to get
the patches posted to the mailing list (with git send-emails/hg
email/quilt) and get them merged into -mm first.

About the subject, git is a super powerful tool, its design saved our
day with kernel.org too. Awesome backed design (I have to admit way
better then mercurial backend in the end, well after the packs have
been introduced) [despite the user interface is still horrible in my
view but it's very well worth the pain to learn to take advantage of
the backend]. The pulls are extremely scalable way to merge stuff, but
they tends to hide stuff and the VM/MM is such a critical piece of the
kernel that in my view it's probably better to go through the pain of
patchbombing linux-mm (maybe not lkml) and pass through -mm for
merging. It's a less scalable approach but it will get more eyes on
the code and if just a single bug is noticed that way, we all win. So
I think you could try to submit the origin/master..origin/tmem with
Andrew and Hugh in CC and see if more comments showup.

> I agree this email is too long, though it has been very useful.

Sure useful to me. I think it's normal and healthy if it gets down to
more lowlevel issues and long emails... There are still a couple of
unanswered issues left in that mail but they're not major if it can be
fixed.

> Confirmed. Anything below the "struct frontswap_ops" (and
> "struct cleancache_ops), that is anything in the staging/zcache
> directory, is wide open for your ideas and improvement.
> In fact, I would very much welcome your contribution and
> I think IBM and Nitin would also.

Thanks. So this overall sounds fairly positive (or at least better
than neutral) to me.

The VM camp is large so I'd be nice to get comments from others too,
especially if they had time to read our exchange to see if their
concerns were similar to mine. Hugh's knowledge of the swap path would
really help (last time he added swapping to KSM).

On my side I hope it get improved over time to get the best out of
it. I've not been hugely impressed so far because at this point in
time it doesn't seem a vast improvement in runtime behavior compared
to what zram could provide, like Rik said there's no iov/SG/vectored
input to tmem_put (which I'd find more intuitive renamed to
tmem_store), like Avi said ramster is synchronous and not good having
to wait a long time. But if we can make these plugins stackable and we
can put a storage backend at the end we could do
storage+zcache+frontswap.

It needs to have future potential to be worthwhile considering it's
not self contained and modifies the core VM actively in a way that
must be maintained over time. I think I already clarified myself well
enough in prev long email to explain what are the reasons that would
made like it or not. And well if I don't like it, it wouldn't mean it
won't get merged, like wrote in prev mail it's not my decision and I
understand the distro issues you pointed out.

Now that you cleared the fact there is no API/ABI in the
staging/zcache directory to worry about, frankly I'm a lot more happy,
I thought at some point Xen would get into the equation in the tmem
code. So I certainly don't want to take the slightest risk of stifling
innovation saying no to something that makes sense and is free to
evolve :).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-03 22:29:34 UTC
Permalink
> From: Andrea Arcangeli [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Andrea --

Sorry for the delayed response... and for continuing this
thread further, but I want to ensure I answer your
points.

First, did you see my reply to Rik that suggested a design
as to how KVM could do batching with no change to the
hooks or frontswap_ops API? (Basically a guest-side
cache and add a batching op to the KVM-tmem ABI.) I think
it resolves your last remaining concern (too many vmexits),
so am eager to see if you agree.

> Like somebody already pointed out (and I agree) it'd be nice to get
> the patches posted to the mailing list (with git send-emails/hg

Frontswap v10 https://lkml.org/lkml/2011/9/15/367 as last posted
to linux-mm has identical code to the git commits... in response
to Konrad and Kame, the commit-set was slightly reorganized and
extended from 6 commits to 8, but absolutely no code differences.
Since no code was changed between v10 and v11, I didn't repost v11
to linux-mm.

Note, every version of frontswap was posted to linux-mm and
cc'ed to Andrew, Hugh, Nick and Rik and I was very diligent
in responding to all comments... Wish I would have
cc'ed you all along as this has been a great discussion.

> email/quilt) and get them merged into -mm first.

Sorry, I'm still a newbie on this process, but just to clarify,
"into -mm" means Andrew merges the patches, right? Andrew
said in the first snippet of https://lkml.org/lkml/2011/11/1/317
that linux-next is fine, so I'm not sure whether to follow your
advice or not.

> Thanks. So this overall sounds fairly positive (or at least better
> than neutral) to me.

Excellent!

> On my side I hope it get improved over time to get the best out of
> it. I've not been hugely impressed so far because at this point in
> time it doesn't seem a vast improvement in runtime behavior compared
> to what zram could provide, like Rik said there's no iov/SG/vectored
> input to tmem_put (which I'd find more intuitive renamed to
> tmem_store), like Avi said ramster is synchronous and not good having
> to wait a long time. But if we can make these plugins stackable and we
> can put a storage backend at the end we could do
> storage+zcache+frontswap.

This thread has been so long, I don't even remember what I've
replied to who, so just to clarify on these several points,
in case you didn't see these elsewhere in the thread:

- Nitin Gupta, author of zram, thinks zcache is an improvement
over zram because it is more flexible/dynamic
- KVM can do batching fairly easily with no changes to the
hooks or frontswap_ops with the design I recently proposed
- RAMster is synchronous, but the requirement is _only_ on the
"local" put... once the data is "in tmem", asynchronous threads
can do other things with it (like RAMster moving the pages
to a tmem pool on a remote system)
- the plugins as they exist today (Xen, zcache) aren't stackable,
but the frontswap_ops registration already handles stacking,
so it is certainly a good future enhancement... RAMster
already does "stacking", but by incorporating a copy of
the zcache code. (I think that's just a code organization
issue that can be resolved if/when RAMster goes into staging.)

With these in mind, I hope you will now be even a "lot more
happy now" with frontswap and MUCH better than neutral. :-) :-)

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-27 22:21:39 UTC
Permalink
> From: Christoph Hellwig [mailto:***@infradead.org]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote:
> > If Linux truly subscribes to the "code rules" mantra, no core
> > VM developer has proposed anything -- even a design, let alone
> > working code -- that comes close to providing the functionality
> > and flexibility that frontswap (and cleancache) provides, and
> > frontswap provides it with a very VERY small impact on existing
> > kernel code AND has been posted and working for 2+ years.
> > (And during that 2+ years, excellent feedback has improved the
> > "kernel-ness" of the code, but NONE of the core frontswap
> > design/hooks have changed... because frontswap _just works_!)
>
> It might work for whatever defintion of work, but you certainly couldn't
> convince anyone that matters that it's actually sexy and we'd actually
> need it. Only actually working on Xen of course doesn't help.
>
> In the end it's a bunch of really ugly hooks over core code, without
> a clear defintion of how they work or a killer use case.

Hi Christoph --

You might find it useful to read the whole base email and/or
the lwn article referenced. Frontswap and cleancache
have now gone far beyond X-e-n** and even beyond virtualization.
That's why my talk at Linuxcon was titled "Transcendent Memory:
Not Just for Virtualization Anymore". (And I stated at
that talk that I have personally not written a line of
X-e-n code in over a year now.) The same frontswap hooks
_just work_ for zcache, RAMster and (soon) KVM too...
and there's more uses coming. Those that take the time
to understand its use model DO find frontswap useful.

Is "sexy" or "killer use case" a requirement for Linus
to merge code now? If so, he can plan to spend a lot
more time diving as I'll bet there isn't much code that
measures up.

Thanks,
Dan

** /me suspects that Christoph has a /dev/null filter for
email containing that word so has cleverly spelled it out
to defeat that filter :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Brian King
2011-10-27 22:33:55 UTC
Permalink
On 10/27/2011 01:52 PM, Dan Magenheimer wrote:
> Hi Linus --
>
> Frontswap now has FOUR users: Two already merged in-tree (zcache
> and Xen) and two still in development but in public git trees
> (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> changes required to support transcendent memory; part 1 was cleancache
> which you merged at 3.0 (and which now has FIVE users).

We are also actively looking at utilizing frontswap for IBM Power and would
welcome its inclusion in mainline.

Thanks,

Brian

--
Brian King
Linux on Power Virtualization
IBM Linux Technology Center
Nitin Gupta
2011-10-28 05:17:16 UTC
Permalink
Hi Dan,

On 10/27/2011 02:52 PM, Dan Magenheimer wrote:

> Hi Linus --
>
> Frontswap now has FOUR users: Two already merged in-tree (zcache
> and Xen) and two still in development but in public git trees
> (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> changes required to support transcendent memory; part 1 was cleancache
> which you merged at 3.0 (and which now has FIVE users).
>


I think frontswap would be really useful. Without this, zcache would be
limited to compressed caching just the page cache pages but with
frontswap, we can balance out compressed memory usage between swap cache
and page cache pages. It also provides many advantages over existing
solutions like zram which presents a fixed size virtual (compressed)
block device interface. Since fronstwap doesn't have to "pretend" as a
block device, it can incorporate many dynamic resizing policies, a
critical factor for compressed caching.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Ed Tomlinson
2011-10-29 13:43:08 UTC
Permalink
On Thursday 27 October 2011 11:52:22 Dan Magenheimer wrote:
> Hi Linus --

> SO... Please pull:
>
> git://oss.oracle.com/git/djm/tmem.git #tmem
>

My wife has an old PC thats short on memory. Its got Ubuntu
running on it. It also has cleancache and zram enabled. The
box works better when using these. Frontcache would improve
things further. It will balance the tmem vs physical memory
dynamicily making it a better solution than zram.

I'd love to see this in the kernel.

Thanks
Ed Tomlinson

PS. At work we use AIX with memory compression. With the
workloads we run compression lets the OS act like it has 30%
more memory. It works. It would be nice to have a similar
facility in Linux.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
KAMEZAWA Hiroyuki
2011-10-31 08:13:21 UTC
Permalink
On Thu, 27 Oct 2011 11:52:22 -0700 (PDT)
Dan Magenheimer <***@oracle.com> wrote:

> Hi Linus --
>
> Frontswap now has FOUR users: Two already merged in-tree (zcache
> and Xen) and two still in development but in public git trees
> (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> changes required to support transcendent memory; part 1 was cleancache
> which you merged at 3.0 (and which now has FIVE users).
>
> Frontswap patches have been in linux-next since June 3 (with zero
> changes since Sep 22). First posted to lkml in June 2009, frontswap
> is now at version 11 and has incorporated feedback from a wide range
> of kernel developers. For a good overview, see
> http://lwn.net/Articles/454795.
> If further rationale is needed, please see the end of this email
> for more info.
>
> SO... Please pull:
>
> git://oss.oracle.com/git/djm/tmem.git #tmem
>
> since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb
> Linus Torvalds (1):
>

Why bypass -mm tree ?

I think you planned to merge this via -mm tree and, then, posted patches
to linux-mm with CC -mm guys.

I think you posted 2011/09/16 at the last time, v10. But no further submission
to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion
request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6.
Why now 8 ? Just because it's simple changes ?

I don't have heavy concerns to the codes itself but this process as bypassing -mm
or linux-next seems ugly.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-01 15:25:38 UTC
Permalink
> From: KAMEZAWA Hiroyuki [mailto:***@jp.fujitsu.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Mon, 31 Oct 2011 09:38:12 -0700 (PDT)
> Dan Magenheimer <***@oracle.com> wrote:
>
> > > I think you planned to merge this via -mm tree and, then, posted patches
> > > to linux-mm with CC -mm guys.
> >
> > Hmmm... the mm process is not clear or well-documented.
>
> not complicated to me.
>
> post -> akpm's -mm tree -> mainline.
>
> But your tree seems to be in -mm via linux-next. Hmm, complicated ;(
> I'm sorry I didn't notice frontswap.c was there....

Am I correct that the "post -> akpm's -mm tree" part requires
akpm to personally merge the posted linux-mm patches into
his -mm tree? So no git tree? I guess I didn't understand
that which is why I never posted v11 and just put it into my
git tree which was being pulled into linux-next.

Anyway, I am learning now... thanks.

> > > I think you posted 2011/09/16 at the last time, v10. But no further submission
> > > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion
> > > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6.
> > > Why now 8 ? Just because it's simple changes ?
> >
> > See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk
> > helped me to reorganize the patches (closer to what you
> > suggested I think), but there were no code changes between
> > v10 and v11, just dividing up the patches differently
> > as Konrad thought there should be more smaller commits.
> > So no code change between v10 and v11 but the number of
> > patches went from 6 to 8.
> >
> > My last line in that post should also make it clear that
> > I thought I was done and ready for the 3.2 window, so there
> > was no evil intent on my part to subvert a process.
> > It would have been nice if someone had told me there
> > were uncompleted steps in the -mm process or, even better,
> > pointed me to a (non-existent?) document where I could see
> > for myself if I was missing steps!
> >
> > So... now what?
>
> As far as I know, patches for memory management should go through akpm's tree.
> And most of developpers in that area see that tree.
> Now, your tree goes through linux-next. It complicates the problem.
>
> When a patch goes through -mm tree, its justification is already checked by
> , at least, akpm. And while in -mm tree, other developpers checks it and
> some improvements are done there.
>
> Now, you tries to push patches via linux-next and your
> justification for patches is checked _now_. That's what happens.
> It's not complicated. I think other linux-next patches are checked
> its justification at pull request.

OK, I will then coordinate with sfr to remove it from the linux-next
tree when (if?) akpm puts the patchset into the -mm tree. But
since very few linux-mm experts had responded to previous postings
of the frontswap patchset, I am glad to have a much wider audience
to discuss it now because of the lkml git-pull request.

> So, all your work will be to convice people that this feature is
> necessary and not-intrusive, here.
>
> From my point of view,
>
> - I have no concerns with performance cost. But, at the same time,
> I want to see performance improvement numbers.

There are numbers published for Xen. I have received
the feedback that benchmarks are needed for zcache also.

> - At discussing an fujitsu user support guy (just now), he asked
> 'why it's not designed as device driver ?"
> I couldn't answered.
>
> So, I have small concerns with frontswap.ops ABI design.
> Do we need ABI and other modules should be pluggable ?
> Can frontswap be implemented as something like
>
> # setup frontswap via device-mapper or some.
> # swapon /dev/frontswap
> ?
> It seems required hooks are just before/after read/write swap device.
> other hooks can be implemented in notifier..no ?

A good question, and it is answered in FAQ #4 included in
the patchset (Documentation/vm/frontswap.txt). The short
answer is that the tmem ABI/API used by frontswap is
intentionally very very dynamic -- ANY attempt to put
a page into it can be rejected by the backend. This is
not possible with block I/O or swap, at least without
a massive rewrite. And this dynamic capability is the
key to supporting the many users that frontswap supports.

By the way, what your fujitsu user support guy suggests is
exactly what zram does. The author of zram (Nitin Gupta)
agrees that frontswap has many advantages over zram,
see https://lkml.org/lkml/2011/10/28/8 and he supports
merging frontswap. And Ed Tomlinson, a current user
of zram says that he would use frontswap instead of
zram: https://lkml.org/lkml/2011/10/29/53

Kame, can I add you to the list of people who support
merging frontswap, assuming more good performance numbers
are posted?

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Rik van Riel
2011-11-02 21:03:26 UTC
Permalink
On 11/01/2011 05:43 PM, Andrew Morton wrote:

> I will confess to and apologise for dropping the ball on cleancache and
> frontswap. I was never really able to convince myself that it met the
> (very vague) cost/benefit test,

I believe that it can, but if it does, we also have to
operate under the assumption that the major distros will
enable it.

This means that "no overhead when not compiled in" is
not going to apply to the majority of the users out there,
and we need clear numbers on what the overhead is when it
is enabled, but not used.

We also need an API that can handle arbitrarily heavy
workloads, since that is what people will throw at it
if it is enabled everywhere.

I believe that means addressing some of Andrea's concerns,
specifically that the API should be able to handle vectors
of pages and handle them asynchronously.

Even if the current back-ends do not handle that today,
chances are that (if tmem were to be enabled everywhere)
people will end up throwing workloads at tmem that pretty
much require such a thing.

An asynchronous interface would probably be a requirement
for something as high latency as encrypted ramster :)

API concerns like this are things that should be solved
before a merge IMHO, since afterwards we would end up with
the "we cannot change the API, because that breaks users"
scenario that we always end up finding ourselves in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 21:42:06 UTC
Permalink
> From: Rik van Riel [mailto:***@redhat.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On 11/01/2011 05:43 PM, Andrew Morton wrote:
>
> > I will confess to and apologise for dropping the ball on cleancache and
> > frontswap. I was never really able to convince myself that it met the
> > (very vague) cost/benefit test,
>
> I believe that it can, but if it does, we also have to
> operate under the assumption that the major distros will
> enable it.
> This means that "no overhead when not compiled in" is
> not going to apply to the majority of the users out there,
> and we need clear numbers on what the overhead is when it
> is enabled, but not used.

Right. That's Case B (see James Bottomley subthread)
and the overhead is one pointer comparison against
NULL per page physically swapin/swapout to a swap
device (i.e., essentially zero). Rik, would you
be willing to examine the code to confirm that
statement?

> We also need an API that can handle arbitrarily heavy
> workloads, since that is what people will throw at it
> if it is enabled everywhere.
>
> I believe that means addressing some of Andrea's concerns,
> specifically that the API should be able to handle vectors
> of pages and handle them asynchronously.
>
> Even if the current back-ends do not handle that today,
> chances are that (if tmem were to be enabled everywhere)
> people will end up throwing workloads at tmem that pretty
> much require such a thing.

Wish I'd been a little faster on typing the previous
message. Rik, could you ensure you respond to yourself
here if you are happy with my proposed batching design
to do the batching that you and Andrea want? (And if
you are not happy, provide code to show where you
would place a new batch-put hook?)

> An asynchronous interface would probably be a requirement
> for something as high latency as encrypted ramster :)

Pure asynchrony is a show-stopper for me. But the
only synchrony required is to move/transform the
data locally. Asynchronous things can still be done
but as a separate thread AFTER the data has been
"put" to tmem (which is exactly what RAMster does).

If asynchrony at frontswap_ops is demanded (and
I think Andrea has already retracted that), I would
have to ask you to present alternate code, both hooks
and driver, that work successfully, because my claim
is that it can't be done, certainly not without
massive changes to the swap subsystem (and likely
corresponding massive changes to VFS for cleancache).

> API concerns like this are things that should be solved
> before a merge IMHO, since afterwards we would end up with
> the "we cannot change the API, because that breaks users"
> scenario that we always end up finding ourselves in.

I think I've amply demonstrated that the API is
minimal and extensible, as demonstrated by the
above points. Much of Andrea's concerns were due to
a misunderstanding of the code in staging/zcache,
thinking it was part of the API; the only "API"
being considered here is defined by frontswap_ops.

Also, the API for frontswap_ops is almost identical to the
API for cleancache_ops and uses a much simpler, much
more isolated set of hooks. Frontswap "finishes"
tmem, cleancache is already merged. Leaving tmem
unfinished is worse than not having it all (and
I can already hear Christoph cackling and jumping
to his keyboard ;-)

Thanks,
Dan

OK, I really need to discontinue my participation in
this for a couple of days for personal/health reasons,
so I hope I've made my case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-10-31 16:38:12 UTC
Permalink
> From: KAMEZAWA Hiroyuki [mailto:***@jp.fujitsu.com]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Kame --

Thanks for your reply and for your earlier reviews of frontswap,
and my apologies that I accidentally left you off of the Cc list \
for the basenote of this git-pull request.

> I don't have heavy concerns to the codes itself but this process as bypassing -mm
> or linux-next seems ugly.

First, frontswap IS in linux-next and it has been since June 3
and v11 has been in linux-next since September 23. This
is stated in the base git-pull request.

> Why bypass -mm tree ?
>
> I think you planned to merge this via -mm tree and, then, posted patches
> to linux-mm with CC -mm guys.

Hmmm... the mm process is not clear or well-documented.
I am a relative newbie here. Linus has repeatedly spoken
of ensuring that code is in linux-next, and there is no
(last I checked) current -mm git tree. I was aware that
the mm tree still existed, but thought it was for shaking
out major features, not for adding a handful of hooks.
I was aware that akpm's blessing was highly desirable,
but his (offlist) reply was essentially "I'm not interested,
I don't have time to deal with this, and I don't think anyone
will use it." I explained about all the users (many of whom
have replied to this thread to support frontswap), but got
no further reply. I was advised by several people that, in
the case of disagreement, Linus will decide, so I pushed
forward. This is the same as the process I used for
cleancache, which Linus merged.

I have been instructed offlist and onlist that this was a big
mistake, that it appears that I am subverting the process,
and that I am probably insulting akpm. If so, I am
truly sorry and would be happy to take instruction
on how to proceed correctly. However, in turn, I hope
that those driving the process aren't blocking useful
code simply due to lack of time.

> I think you posted 2011/09/16 at the last time, v10. But no further submission
> to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion
> request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6.
> Why now 8 ? Just because it's simple changes ?

See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk
helped me to reorganize the patches (closer to what you
suggested I think), but there were no code changes between
v10 and v11, just dividing up the patches differently
as Konrad thought there should be more smaller commits.
So no code change between v10 and v11 but the number of
patches went from 6 to 8.

My last line in that post should also make it clear that
I thought I was done and ready for the 3.2 window, so there
was no evil intent on my part to subvert a process.
It would have been nice if someone had told me there
were uncompleted steps in the -mm process or, even better,
pointed me to a (non-existent?) document where I could see
for myself if I was missing steps!

So... now what?

Thanks,
Dan

P.S. It appears that this excerpt from the LWN KS2011 report
might be related to the problem?

"Andrew complained about the acceptance of entirely new
features into the kernel. Those features often land on
his doorstep without much justification, forcing him to
ask the developers to explain their motivations. The kernel
community, he complained, is not supporting him well. Who
can tell him if a given patch makes sense? Mistakes have
been made in the past; bad features have been merged and
good stuff has been lost. How, he asked, can he find
people who know better about the desirability of
specific patches?"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
KAMEZAWA Hiroyuki
2011-11-01 00:50:38 UTC
Permalink
On Mon, 31 Oct 2011 09:38:12 -0700 (PDT)
Dan Magenheimer <***@oracle.com> wrote:

> > From: KAMEZAWA Hiroyuki [mailto:***@jp.fujitsu.com]
> > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
>
> Hi Kame --
>
> Thanks for your reply and for your earlier reviews of frontswap,
> and my apologies that I accidentally left you off of the Cc list \
> for the basenote of this git-pull request.
>
> > I don't have heavy concerns to the codes itself but this process as bypassing -mm
> > or linux-next seems ugly.
>
> First, frontswap IS in linux-next and it has been since June 3
> and v11 has been in linux-next since September 23. This
> is stated in the base git-pull request.
>

Ok, I'm sorry. I found frontswap.c in my tree.


> > Why bypass -mm tree ?
> >
> > I think you planned to merge this via -mm tree and, then, posted patches
> > to linux-mm with CC -mm guys.
>
> Hmmm... the mm process is not clear or well-documented.

not complicated to me.

post -> akpm's -mm tree -> mainline.

But your tree seems to be in -mm via linux-next. Hmm, complicated ;(
I'm sorry I didn't notice frontswap.c was there....


> > I think you posted 2011/09/16 at the last time, v10. But no further submission
> > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion
> > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6.
> > Why now 8 ? Just because it's simple changes ?
>
> See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk
> helped me to reorganize the patches (closer to what you
> suggested I think), but there were no code changes between
> v10 and v11, just dividing up the patches differently
> as Konrad thought there should be more smaller commits.
> So no code change between v10 and v11 but the number of
> patches went from 6 to 8.
>
> My last line in that post should also make it clear that
> I thought I was done and ready for the 3.2 window, so there
> was no evil intent on my part to subvert a process.
> It would have been nice if someone had told me there
> were uncompleted steps in the -mm process or, even better,
> pointed me to a (non-existent?) document where I could see
> for myself if I was missing steps!
>
> So... now what?
>

As far as I know, patches for memory management should go through akpm's tree.
And most of developpers in that area see that tree.
Now, your tree goes through linux-next. It complicates the problem.

When a patch goes through -mm tree, its justification is already checked by
, at least, akpm. And while in -mm tree, other developpers checks it and
some improvements are done there.

Now, you tries to push patches via linux-next and your
justification for patches is checked _now_. That's what happens.
It's not complicated. I think other linux-next patches are checked
its justification at pull request.

So, all your work will be to convice people that this feature is
necessary and not-intrusive, here.

>From my point of view,

- I have no concerns with performance cost. But, at the same time,
I want to see performance improvement numbers.

- At discussing an fujitsu user support guy (just now), he asked
'why it's not designed as device driver ?"
I couldn't answered.

So, I have small concerns with frontswap.ops ABI design.
Do we need ABI and other modules should be pluggable ?
Can frontswap be implemented as something like

# setup frontswap via device-mapper or some.
# swapon /dev/frontswap
?
It seems required hooks are just before/after read/write swap device.
other hooks can be implemented in notifier..no ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dan Magenheimer
2011-11-02 15:12:01 UTC
Permalink
> From: KAMEZAWA Hiroyuki [mailto:***@jp.fujitsu.com]

Hi Kame --

> > By the way, what your fujitsu user support guy suggests is
> > exactly what zram does. The author of zram (Nitin Gupta)
> > agrees that frontswap has many advantages over zram,
> > see https://lkml.org/lkml/2011/10/28/8 and he supports
> > merging frontswap. And Ed Tomlinson, a current user
> > of zram says that he would use frontswap instead of
> > zram: https://lkml.org/lkml/2011/10/29/53
> >
> > Kame, can I add you to the list of people who support
> > merging frontswap, assuming more good performance numbers
> > are posted?
> >
> Before answer, let me explain my attitude to this project.
>
> As hobby, I like this kind of work which allows me to imagine what kind
> of new fancy features it will allow us. Then, I reviewed patches.
>
> As people who sells enterprise system and support, I can't recommend this
> to our customers. IIUC, cleancache/frontswap/zcache hides its avaiable
> resources from user's view and making the system performance unvisible and
> not-predictable. That's one of the reason why I asksed whether or not
> you have plans to make frontswap(cleancache) cgroup aware.
> (Hmm, but at making a product which offers best-effort-performance to customers,
> this project may make sense. But I am not very interested in best-effort
> service very much.)

I agree that zcache is not a good choice for enterprise customers
trying to achieve predictable QoS. Tmem works to improve
memory efficiency (with zcache backend) and/or take advantage
of statistical variations in working sets across multiple virtual
(Xen backend and KVM work-in-progress backend) or physical
(RAMster backend) machines so, you are correct, there will
be some non-visible and non-predictable effects of tmem.

In a strict QoS environment, the data center must ensure that all
resources are overprovisioned, including RAM. RAM on each machine
must exceed the peak working set on that machine or QoS guarantees
won't be met. Tmem has no value when RAM is "infinite", that is,
when RAM can be increased arbitrarily to ensure that it always exceeds
the peak working set. Tmem has great value when RAM is sometimes less
than the working set. This is most obvious today in consolidated
virtualization environments, but (as shown in my presentations)
is increasingly a system topology. For example:

Resource optimization across a broad set of users with unknown and
time-varying workloads (and thus working sets) is necessary for
"cloud providers" to profit. In many such environments,
RAM is becoming the bottleneck and cloud providers can't
ensure that RAM is "infinite". Cloud users that require absolute
control over their performance are instructed to pay a much
higher price to "rent" a physical server.

In some parts of the US (and I think in other countries as well),
electricity providers offer a discount to customers that are willing
to allow the provider to remotely disable their air conditioning
units when electricity demand peaks across the entire grid.
Tmem allows cloud providers to offer a similar feature to
their users. This is neither guaranteed-QoS nor "best effort"
but allows the provider to expand the capabilities of their
data center as needed, rather than predict peak demand and
pre-provision for it.

I agree, IMHO, zcache is more for small single machines (possibly
mobile units) where RAM is limited or at capacity and the workload
is bumping into that limit (resulting in swapping). Ed Tomlinson
presents a good example: https://lkml.org/lkml/2011/10/29/53
But IBM seems to be _very_ interested in zcache and is not
in the desktop business, so probably is working on some cool
use model for servers that I've never thought of.

> I wonder if there are 'static size simple victim cache per cgroup' project
> under frontswap/cleancache and it helps all user's workload isolation
> even if there is no VM or zcache, tmem. It sounds wonderful.
>
> So, I'd like to ask whether you have any enhancement plans in future ?
> rather than 'current' peformance. The reason I hesitate to say "Okay!",
> is that I can't see enterprise usage of this, a feature which cannot
> be controlled by admins and make perfomrance prediction difficult in busy system.

Personally, my only enhancement plan is to work on RAMster
until it is ready for the staging tree. But once the
foundations of tmem (frontswap and cleanache) are in-tree,
I hope that you and other developers will find other clever
ways to exploit it. For example, Larry Bassel's postings on
linux-mm uncovered a new use for cleancache that I had not
considered (so I think cleancache now has five users).

> > Kame, can I add you to the list of people who support
> > merging frontswap, assuming more good performance numbers
> > are posted?

So I'm not asking you if Fujitsu enterprise QoS-guarantee
customers will use zcache.... Andrew said yesterday:

"At kernel summit there was discussion and overall agreement
that we've been paying insufficient attention to the
big-picture "should we include this feature at all" issues.
We resolved to look more intensely and critically at new
features with a view to deciding whether their usefulness
justified their maintenance burden."

I am asking you, who are an open source Linux developer and
a respected -mm developer, do you think the usefulness
of frontswap justifies the maintenance burden, and frontswap
should be merged?

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
KAMEZAWA Hiroyuki
2011-11-04 04:19:32 UTC
Permalink
On Wed, 2 Nov 2011 08:12:01 -0700 (PDT)
Dan Magenheimer <***@oracle.com> wrote:

> > > Kame, can I add you to the list of people who support
> > > merging frontswap, assuming more good performance numbers
> > > are posted?
>
> So I'm not asking you if Fujitsu enterprise QoS-guarantee
> customers will use zcache.... Andrew said yesterday:
>
> "At kernel summit there was discussion and overall agreement
> that we've been paying insufficient attention to the
> big-picture "should we include this feature at all" issues.
> We resolved to look more intensely and critically at new
> features with a view to deciding whether their usefulness
> justified their maintenance burden."
>
> I am asking you, who are an open source Linux developer and
> a respected -mm developer, do you think the usefulness
> of frontswap justifies the maintenance burden, and frontswap
> should be merged?
>

When you convince other guys that the design is good.
At reading the whole threads, it seems other deveoppers raise
2 problems.
1. justification of usage
2. API design.

For 1, you'll need to show performance and benefits. I think
you tried and you'll do, again. But please take care of "2", it
seems some guys (Rik and Andrea) has concerns.

Please CC me, I'd like to join code review process, at least.
I'd like to think of a new usage for frontswap/cleancache benficial
for enterprise users.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jan Beulich
2011-11-03 16:49:27 UTC
Permalink
>>> On 27.10.11 at 20:52, Dan Magenheimer <***@oracle.com> wrote:
> Hi Linus --
>
> Frontswap now has FOUR users: Two already merged in-tree (zcache
> and Xen) and two still in development but in public git trees
> (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> changes required to support transcendent memory; part 1 was cleancache
> which you merged at 3.0 (and which now has FIVE users).
>
> Frontswap patches have been in linux-next since June 3 (with zero
> changes since Sep 22). First posted to lkml in June 2009, frontswap
> is now at version 11 and has incorporated feedback from a wide range
> of kernel developers. For a good overview, see
> http://lwn.net/Articles/454795.
> If further rationale is needed, please see the end of this email
> for more info.
>
> SO... Please pull:
>
> git://oss.oracle.com/git/djm/tmem.git #tmem
>
>...
> Linux kernel distros incorporating frontswap:
> - Oracle UEK 2.6.39 Beta:
> http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary
> - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c]
> http://kernel.opensuse.org/cgit/kernel/

I've been away so I am too far behind to read this entire
very long thread, but wanted to confirm that we've been
carrying an earlier version of this code as indicated above
and it would simplify our kernel maintenance if frontswap
got merged. So please count me as supporting frontswap.

Thanks, Jan

> - a popular Gentoo distro
> http://forums.gentoo.org/viewtopic-t-862105.html
>
> Xen distros supporting Linux guests with frontswap:
> - Xen hypervisor backend since Xen 4.0 (2009)
> http://www.xen.org/files/Xen_4_0_Datasheet.pdf
> - OracleVM since 2.2 (2009)
> http://twitter.com/#!/Djelibeybi/status/113876514688352256
>
> Public visibility for frontswap (as part of transcendent memory):
> - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle
> Open World 2011, two LSF/MM Summits (2010,2011), and three
> Xen Summits (2009,2010,2011)
> - http://lwn.net/Articles/454795 (current overview)
> - http://lwn.net/Articles/386090 (2010)
> - http://lwn.net/Articles/340080 (2009)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2011-11-04 00:54:10 UTC
Permalink
On Thu, 03 Nov 2011 16:49:27 +0000 "Jan Beulich" <***@suse.com> wrote:

> >>> On 27.10.11 at 20:52, Dan Magenheimer <***@oracle.com> wrote:
> > Hi Linus --
> >
> > Frontswap now has FOUR users: Two already merged in-tree (zcache
> > and Xen) and two still in development but in public git trees
> > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
> > changes required to support transcendent memory; part 1 was cleancache
> > which you merged at 3.0 (and which now has FIVE users).
> >
> > Frontswap patches have been in linux-next since June 3 (with zero
> > changes since Sep 22). First posted to lkml in June 2009, frontswap
> > is now at version 11 and has incorporated feedback from a wide range
> > of kernel developers. For a good overview, see
> > http://lwn.net/Articles/454795.
> > If further rationale is needed, please see the end of this email
> > for more info.
> >
> > SO... Please pull:
> >
> > git://oss.oracle.com/git/djm/tmem.git #tmem
> >
> >...
> > Linux kernel distros incorporating frontswap:
> > - Oracle UEK 2.6.39 Beta:
> > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary
> > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c]
> > http://kernel.opensuse.org/cgit/kernel/
>
> I've been away so I am too far behind to read this entire
> very long thread, but wanted to confirm that we've been
> carrying an earlier version of this code as indicated above
> and it would simplify our kernel maintenance if frontswap
> got merged. So please count me as supporting frontswap.

Are you able to tell use *why* you're carrying it, and what benefit it
is providing to your users?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jan Beulich
2011-11-04 08:49:34 UTC
Permalink
>>> On 04.11.11 at 01:54, Andrew Morton <***@linux-foundation.org> wrote:
> On Thu, 03 Nov 2011 16:49:27 +0000 "Jan Beulich" <***@suse.com> wrote:
>
>> >>> On 27.10.11 at 20:52, Dan Magenheimer <***@oracle.com> wrote:
>> > Hi Linus --
>> >
>> > Frontswap now has FOUR users: Two already merged in-tree (zcache
>> > and Xen) and two still in development but in public git trees
>> > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel
>> > changes required to support transcendent memory; part 1 was cleancache
>> > which you merged at 3.0 (and which now has FIVE users).
>> >
>> > Frontswap patches have been in linux-next since June 3 (with zero
>> > changes since Sep 22). First posted to lkml in June 2009, frontswap
>> > is now at version 11 and has incorporated feedback from a wide range
>> > of kernel developers. For a good overview, see
>> > http://lwn.net/Articles/454795.
>> > If further rationale is needed, please see the end of this email
>> > for more info.
>> >
>> > SO... Please pull:
>> >
>> > git://oss.oracle.com/git/djm/tmem.git #tmem
>> >
>> >...
>> > Linux kernel distros incorporating frontswap:
>> > - Oracle UEK 2.6.39 Beta:
>> > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary
>> > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c]
>> > http://kernel.opensuse.org/cgit/kernel/
>>
>> I've been away so I am too far behind to read this entire
>> very long thread, but wanted to confirm that we've been
>> carrying an earlier version of this code as indicated above
>> and it would simplify our kernel maintenance if frontswap
>> got merged. So please count me as supporting frontswap.
>
> Are you able to tell use *why* you're carrying it, and what benefit it
> is providing to your users?

Because we're supporting/using Xen, where this (within the general
tmem picture) allows for better overall memory utilization.

Jan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Clayton Weaver
2011-11-04 12:37:21 UTC
Permalink
"So where I can I buy Network Attached Ram and skip all of
this byzantine VM complication?"

So let me see if I have this right: when the frontswap
back end fills up, the current design would force dumping
newer pages to real on-disk swap (to avoid OOM), possibly
compressed, while keeping older pages in the compressed
ram swap cache? It seems like it should just dump
(blocksize/pagesize) * pagesize multiples of its oldest
compressed pages to disk instead and store and compress
the new pages that are submitted to it, thus preserving
the "least recently used" logic in the frontswap backend.

A backend to frontswap should not be able to fail a put
at all (unless the whole machine or container is OOM and
no physical swap is configured, so the backend contains
no pages and has no space to allocate from).

--

Clayton Weaver
cgweav at fastmail dot fm
Clayton Weaver
2011-11-05 17:08:23 UTC
Permalink
(NB: My only dog in this hunt is the length of this thread.)

When swapping to rotating media, all swapped pages have the same
age. Is there any performance reason to keep this property when
swapping to in-memory swap space that has rotating media or
some other longer-latency swap space for worst-case swap storage?
Is there any performance reason to extend lru logic to this type
of low-latency/high-latency swap?

Seems like an obvious question.

Will all of these potential frontswap backends want page compression?
(Should it be factored out into a common page compression
implementation that anything can use? Does this already exist? How
many pages should it operate on at one time, batched together to get
higher average compression ratios?)
--

Clayton Weaver
cgweav at fastmail dot fm
Loading...