direct_access, pinning and truncation

[... figuring out how g_u_p() references can prevent freeing and
re-using the underlying mapped pmem addresses given the lack of struct
pages for the mapping]

Post by Matthew Wilcox
1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
the caller with the struct pages of the DRAM. Modify DAX to handle some
file pages being in the page cache, and make sure that we know whether
the PMEM or DRAM is up to date. This has the obvious downside that
get_user_pages() becomes slow.

And serialize transitions and fs stores to pmem regions. And now
storing to dram-fronted pmem goes through all the dirtying and writeback
machinery. This sounds like a nightmare to me, to be honest.

This seems.. doable? Recording the referenced pmem in free lists in the
fs is fine as long as the pmem isn't modified until the references are
released, right?

Maybe in the allocator you skip otherwise free blocks if they intersect
with the run time structure (rbtree of extents, presumably) that is
taking the place of reference counts in struct page. There aren't
*that* many allocator entry points. I guess you'd need to avoid other
modifications of free space like trimming :/. It still seems reasonably
doable?

And hey, lord knows we love to implement rbtrees of extents in file
systems! (btrfs: struct extent_state, ext4: struct extent_status)

The tricky part would be maintaining that structure behind g_u_p() and
put_page() calls. Probably a richer interface that gives callers
something more than just raw page pointers.

Post by Matthew Wilcox
3. Make truncate() block if it hits a pinned page. There's really no
good reason to truncate a file that has pinned pages; it's either a bug
or you're trying to be nasty to someone. We actually already have code
for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
just for O_DIRECT I/Os and other transient users like crypto, it's also
for long-lived things like RDMA, where we could potentially block for
an indefinite time.

I have no concrete examples, but I agree that it sounds like the sort of
thing that would bite us in the ass if we miss some use case :/.

I guess my initial vote is for trying a less-than-perfect prototype of
#2 to see just how hairy the rough outline gets.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2014-10-09 16:44:44 UTC

Post by Zach Brown
[... figuring out how g_u_p() references can prevent freeing and
re-using the underlying mapped pmem addresses given the lack of struct
pages for the mapping]

And serialize transitions and fs stores to pmem regions. And now
storing to dram-fronted pmem goes through all the dirtying and writeback
machinery. This sounds like a nightmare to me, to be honest.

That's not so bad ... it's just normal page-cache stuff, really. It'd be
per-page serialisation, just like the current gunk we go through to get
sparse loads to not allocate backing store.

Post by Zach Brown

This seems.. doable? Recording the referenced pmem in free lists in the
fs is fine as long as the pmem isn't modified until the references are
released, right?

As long as it's not *allocated* to anything else (which seems to be what
you're actually saying in the next paragraph).

Post by Zach Brown
Maybe in the allocator you skip otherwise free blocks if they intersect
with the run time structure (rbtree of extents, presumably) that is
taking the place of reference counts in struct page. There aren't
*that* many allocator entry points. I guess you'd need to avoid other
modifications of free space like trimming :/. It still seems reasonably
doable?

Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory. Nice.

Post by Zach Brown
And hey, lord knows we love to implement rbtrees of extents in file
systems! (btrfs: struct extent_state, ext4: struct extent_status)
The tricky part would be maintaining that structure behind g_u_p() and
put_page() calls. Probably a richer interface that gives callers
something more than just raw page pointers.

Thinking about it now, it seems less hairy than I initially thought. I'll
give it a quick try and see how it goes.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Zach Brown

2014-10-09 19:14:02 UTC

Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory. Nice.

Yeah, that's what I was picturing. The part I'm most fuzzy on is how to
get current g_u_p() callers consuming the mappings without full struct
pages.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jan Kara

2014-10-10 10:01:30 UTC

Post by Zach Brown

Ah, so on reboot, the on-disk data structures are all correct, and
the in-memory data structures went away with the runtime pinning of
the memory. Nice.

Yeah, that's what I was picturing. The part I'm most fuzzy on is how to
get current g_u_p() callers consuming the mappings without full struct
pages.

So the direct IO layer could be relatively easily converted to use just
PFNs instead of struct pages. But you'd also have to change block layer
(bios) to work with PFNs instead of struct page and that's going to be
non-trivial IMHO.

Honza

--
Jan Kara <***@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dave Chinner

2014-10-09 01:10:38 UTC

Post by Matthew Wilcox
One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file.

I don't understand the motivation or the use case: O_DIRECT is
purely for bypassing the page cache, and DAX already bypasses the
page cache. What difference is there between the DAX read/write
path and a DAX-based O_DIRECT IO path, and why doesn't just ignoring
O_DIRECT for DAX enabled filesystems simply do what you need?

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2014-10-09 15:25:24 UTC

Post by Matthew Wilcox
One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file.

There are two filesystems involved ... if both (or neither!) are DAX,
everything's fine. The problem comes when you do things this way around:

int cachefd = open("/dax/cache", O_RDWR);
int datafd = open("/nfs/bigdata", O_RDWR | O_DIRECT);
void *cache = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_SHARED, cachefd, 0);
read(datafd, cache, 1024 * 1024);

The non-DAX filesystem needs to pin pages from the DAX filesystem while
they're under I/O.

Another attempt to solve this problem might be to turn the O_DIRECT
read into a read into a page of DRAM, followed by a copy from DRAM
to PMEM. Conversely, writes could be done as a copy to DRAM followed
by a page-based write.

You also elided the paragraphs where I point out that this is an example
of a more general problem; there really are people who want to do RDMA
to DAX memory (the HPC crowd, of course), and we need to not open up
security holes when enabling that. Since it's a potentially long-duration
and bi-directional mapping, the copy solution isn't going to work here
(without going all the way to solution 1).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dave Chinner

2014-10-13 01:19:32 UTC

Post by Matthew Wilcox
One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file.

There are two filesystems involved ... if both (or neither!) are DAX,
int cachefd = open("/dax/cache", O_RDWR);
int datafd = open("/nfs/bigdata", O_RDWR | O_DIRECT);
void *cache = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_SHARED, cachefd, 0);
read(datafd, cache, 1024 * 1024);
The non-DAX filesystem needs to pin pages from the DAX filesystem while
they're under I/O.

OK, that's what I was missing - it's not direct IO into/out of the
DAX filesystem - it's when you use the mmap()d DAX pages as the
source/destination of said direct IO.

Cheers,

Dave.

Boaz Harrosh

2014-10-19 09:51:38 UTC

Post by Matthew Wilcox
One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file.

This BTW works today. What happens is that get_user_pages() fails, so
directIO of NFS above fails and the VFS will just revert to buffered IO which
will work just fine with a simple memcpy to/from NFS's page-cache

Post by Matthew Wilcox
The non-DAX filesystem needs to pin pages from the DAX filesystem while
they're under I/O.
Another attempt to solve this problem might be to turn the O_DIRECT
read into a read into a page of DRAM, followed by a copy from DRAM
to PMEM. Conversely, writes could be done as a copy to DRAM followed
by a page-based write.

So that's kind of stupid, why not let it be a @datafd's page cache like
what actually happen today?

Post by Matthew Wilcox
You also elided the paragraphs where I point out that this is an example
of a more general problem; there really are people who want to do RDMA
to DAX memory (the HPC crowd, of course),

I do not yet see how in your proposal you can ever do RDMA without my
page-structs-for-pmem patch? This was exactly my motivation to enable this,
and to enable direct block layer access to pmem.

And Yes once the page-struct ref is held say by RDMA, it must be left unallocateable
until its refcount drops. This is exactly what we did in our pmem+pages based
FS.

Today RDMA and/or any other subsystem access is not possible, and does not
have this problem.

Post by Matthew Wilcox
And we need to not open up
security holes when enabling that. Since it's a potentially long-duration
and bi-directional mapping, the copy solution isn't going to work here

I agree we should be careful to not open any holes. If done right it should be
good. A pmem aware FS should monitor the reference count of the pmem-page-struct
and if still held must not recycle that block to free-store but keep it held
until the reference drops. It is quite simple really.

That said a sane application should not have this problem. There should not be
a possibility for the RDMA to access loosely coupled pages that belongs to nothing.
(That used to belong to an mmaped file). For example taking some kind of flock on
the file will make the truncate wait until file is closed by app. And app does not
close until RDMA mapping is closed. Otherwise what is the point of this app?

I agree that exposing pmem to external subsytems, unlike today, might pose new
challenges. But these are doable.

On top of Matthew's DAX patches, there can be a simple API established with the FS
where dax_truncate_page can communicate that a certain block must not yet be returned
to free-store after the truncate, and will be returned to free-store later on.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jan Kara

2014-10-10 13:08:05 UTC

Post by Matthew Wilcox
One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file. Right now, it simply doesn't work
because there's no struct page for the memory, so get_user_pages() fails.
Boaz has posted a patch to create struct pages for direct_access files,
which is certainly one way of solving the immediate problem, but it
ignores the deeper problem.

Maybe we can set some terminology - direct IO has two 'endpoints' (I
don't want to talk about source / target because that swaps when talking
about reads / writes). One endpoint is a 'buffer' and another endpoint is a
'storage'. Now 'buffer' may be a memory mapped file on some filesystem.
In your case what isn't working is when 'buffer' is mmaped file on a DAX
filesystem.

Post by Matthew Wilcox
For normal files, get_user_pages() elevates the reference count on
the pages. If those pages are subsequently truncated from the file,
the underlying file blocks are released to the filesystem's free pool.
The pages are removed from the page cache and the process's address space,
but hang around until the caller of get_user_pages() calls put_page() on
them again at which point they are released into the pool of free pages.
Once we have a struct page for (or some other way to handle pinning of)
persistent memory blocks, truncating a file that has pinned pages will
still cause the disk blocks to be released to the free pool. But there
weren't any pages of DRAM between the filesystem and the application!
So those blocks are "freed" while still referenced. And that reference
might well be programmed into a piece of hardware that's doing DMA;
it can't be stopped.
1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
the caller with the struct pages of the DRAM. Modify DAX to handle some
file pages being in the page cache, and make sure that we know whether
the PMEM or DRAM is up to date. This has the obvious downside that
get_user_pages() becomes slow.
2. Modify filesystems that support DAX to handle pinning blocks.
Some filesystems (that support COW and snapshots) already support
reference-counting individual blocks. We may be ale to do better by
using a tree of pinned extents or something. This makes it much harder
to modify a filesystem to support DAX, and I don't see patches adding
this capability to ext2 being warmly welcomed.
3. Make truncate() block if it hits a pinned page. There's really no
good reason to truncate a file that has pinned pages; it's either a bug
or you're trying to be nasty to someone. We actually already have code
for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
just for O_DIRECT I/Os and other transient users like crypto, it's also
for long-lived things like RDMA, where we could potentially block for
an indefinite time.

What option 3 seems to implicitely assume is that there are 'struct
pages' to pin. So do you expect to add struct page to PFNs which were a
target of get_user_pages()? And then check whether PFN is pinned (has
corresponding struct page) in the truncate code?

Note that inode_dio_wait() isn't really what you look for. That waits for
DIO pending against 'storage'. Currently we don't track in any way (except
for elevated page reference counts) that 'buffer' is an endpoint of direct
IO.

Thinking about options over and over again, I think trying something like
2) might be good. I'd still attach struct page to pinned PFNs to avoid some
troubles but you could delay freeing of fs blocks if they are pinned by
get_user_pages(). You could just hook into a path where filesystem frees
blocks - e.g. ext4 already does this anyway in ext4_mb_free_metadata()
since we free blocks in in-memory bitmaps only after the current
transaction is committed (changes in in-memory bitmaps happen from
ext4_journal_commit_callback(), which calls ext4_free_data_callback()). So
ext4 already handles the situation where in-memory bitmaps are different
from on disk ones and what you need is no different.

Honza

Matthew Wilcox

2014-10-10 14:24:30 UTC

Post by Jan Kara

Good terminology :-)

Post by Jan Kara

Post by Matthew Wilcox
2. Modify filesystems that support DAX to handle pinning blocks.
Some filesystems (that support COW and snapshots) already support
reference-counting individual blocks. We may be ale to do better by
using a tree of pinned extents or something. This makes it much harder
to modify a filesystem to support DAX, and I don't see patches adding
this capability to ext2 being warmly welcomed.
3. Make truncate() block if it hits a pinned page. There's really no
good reason to truncate a file that has pinned pages; it's either a bug
or you're trying to be nasty to someone. We actually already have code
for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
just for O_DIRECT I/Os and other transient users like crypto, it's also
for long-lived things like RDMA, where we could potentially block for
an indefinite time.

I'm assuming that we come up with *some* way to solve the missing struct
page problem. Whether it's restructuring splice, O_DIRECT and RDMA to do
without struct pages, whether it's dynamically allocating struct pages,
whether it's statically allocating struct pages, whether it's coming up
with some other data structure that takes the place of struct page for
DAX ... doesn't matter for this part of the conversation.

Post by Jan Kara
Note that inode_dio_wait() isn't really what you look for. That waits for
DIO pending against 'storage'. Currently we don't track in any way (except
for elevated page reference counts) that 'buffer' is an endpoint of direct
IO.

Ah, I wasn't clear ... I was proposing incrementing i_dio_count on the
buffer's inode when get_user_pages() was called.

Post by Jan Kara
Thinking about options over and over again, I think trying something like
2) might be good. I'd still attach struct page to pinned PFNs to avoid some
troubles but you could delay freeing of fs blocks if they are pinned by
get_user_pages(). You could just hook into a path where filesystem frees
blocks - e.g. ext4 already does this anyway in ext4_mb_free_metadata()
since we free blocks in in-memory bitmaps only after the current
transaction is committed (changes in in-memory bitmaps happen from
ext4_journal_commit_callback(), which calls ext4_free_data_callback()). So
ext4 already handles the situation where in-memory bitmaps are different
from on disk ones and what you need is no different.

If this is something that (some) filesystems already do, then I feel
much happier about this idea!
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Boaz Harrosh

2014-10-19 11:08:07 UTC

On 10/10/2014 05:24 PM, Matthew Wilcox wrote:
<>

Post by Matthew Wilcox
I'm assuming that we come up with *some* way to solve the missing struct
page problem. Whether it's restructuring splice, O_DIRECT and RDMA to do
without struct pages,

That makes no sense to me, where will it end? You are doubling the size of the
code to have two paths, and there will always be a subsystem you did not touch
and is missing support. And why? page was already invented to do exactly what you
want, track state of a PFN.

Post by Matthew Wilcox
whether it's dynamically allocating struct pages,

I have tried this. It does not work. The PFN <-> page mapping is directly calculated
from the phisical/virtual addresses. Through the use of the section object.

struct page is actually just a part of a bigger "section" object. You do
not allocate an individual page-struct. You need to allocate a memory "section"

Post by Matthew Wilcox
whether it's statically allocating struct pages,

The best I came up with was "hotplug" allocation. It is rather static, but
hot plugable. Please inspect my patches. This came out very simple, the minimum code
possible that gives you all the above support, without need to change any of these
subsystems, and plugs just nicely into all the other subsystems you have not mentioned.

Post by Matthew Wilcox
whether it's coming up
with some other data structure that takes the place of struct page for
DAX ...

Again. Why reinvent the wheel when the old one works perfectly and does
everything you want, including the most important aspect. Not adding any
new infrastructure, and/or modifying any code. So why even think about it?

Post by Matthew Wilcox
doesn't matter for this part of the conversation.

I agree, this does not solve the reference problem, in this case DAX will
need an new entry into the FS to communicate delayed free-block. But as Jan
pointed out this is not against current FS structure.

I think lots of current DAX problems and performance short comings can be
solved very nicely if we assume we have struct-page for pmem. For example
the use of the page-lock instead of the i_mutex we take today.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dave Chinner

2014-10-19 23:01:52 UTC

Post by Boaz Harrosh
<>

Post by Matthew Wilcox
I'm assuming that we come up with *some* way to solve the missing struct
page problem. Whether it's restructuring splice, O_DIRECT and RDMA to do
without struct pages,

.....

Post by Boaz Harrosh

Post by Matthew Wilcox
whether it's coming up
with some other data structure that takes the place of struct page for
DAX ...

Post by Matthew Wilcox
doesn't matter for this part of the conversation.

Which makes me look at what DAX is intended for.

DAX is an enabler, allowing us to get direct access to PMEM with
*existing filesystem technologies*. I don't want to have to add new
extent management functions to XFS to add temporary references to
allow DAX to hold onto extents after an inode has been freed because
some RDMA app has pinned the PMEM and forgot to let it go. That way
lies madness for existing filesystems - yes, we can add such warts
to them, but it's ugly, nasty and needed only by a very, very small
lunatic fringe of users.

IMO, this proposal is way outside the original DAX-replaces-XIP scope;
I really don't think that requiring extensive modifications to
filesystems to use DAX is a good idea. Apart from it being contrary to the
original architectural goal of DAX (which was "enable direct access
with minimal filesystem implementation impact"), we risk significant
impact on non-DAX users by requiring architectural changes to the
underlying filesystems to support DAX.

So my question is this: at what point do we say "out of scope for
DAX, make this work with a native PMEM filesystem"? DAX as it
stands fills the "95% of what people need" goal with minimal effort;
our efforts should be focussed on merging what we have, not creeping
the scope and making it harder to implement and get merged.

If we want RDMA into PMEM devices or direct IO to/from persisten
memory, then I'd suggest that this is functionality that belongs in
native PMEM storage devices/filesystems and should be designed to be
efficient in that environment way from the ground up.

Cheers,

Dave.

Boaz Harrosh

2014-10-21 09:17:42 UTC