[PATCH v11 01/21] axonram: Fix bug in direct_access

Discussion:

[PATCH v11 01/21] axonram: Fix bug in direct_access

Matthew Wilcox

2014-09-25 20:33:18 UTC

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

return 0;
}

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:22 UTC

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip(). Prevent I/Os to
files tagged with the DAX flag from falling back to buffered I/O.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
---
fs/ext2/inode.c | 9 ++++++---
fs/ext2/xip.h | 2 --
include/linux/fs.h | 6 ++++++
mm/filemap.c | 19 ++++++++++++-------
4 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 36d35c3..0cb0448 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
goto cleanup;
}

- if (ext2_use_xip(inode->i_sb)) {
+ if (IS_DAX(inode)) {
/*
* we need to clear the block
*/
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)

inode_dio_wait(inode);

- if (mapping_is_xip(inode->i_mapping))
+ if (IS_DAX(inode))
error = xip_truncate_page(inode->i_mapping, newsize);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_DAX);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_flags |= S_DAX;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
}
int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e99e5c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1605,6 +1605,7 @@ struct super_operations {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
+#define S_DAX 8192 /* Direct Access, avoiding the page cache */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1642,6 +1643,11 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode) 0
+#endif

/*
* Inode state bits. Protected by inode->i_lock
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..fec4db9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1718,9 +1718,11 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
* we've already read everything we wanted to, or if
* there was a short read because we hit EOF, go ahead
* and return. Otherwise fallthrough to buffered io for
- * the rest of the read.
+ * the rest of the read. Buffered reads will not work for
+ * DAX files, so don't bother trying.
*/
- if (retval < 0 || !iov_iter_count(iter) || *ppos >= size) {
+ if (retval < 0 || !iov_iter_count(iter) || *ppos >= size ||
+ IS_DAX(inode)) {
file_accessed(file);
goto out;
}
@@ -2584,13 +2586,16 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
loff_t endbyte;

written = generic_file_direct_write(iocb, from, pos);
- if (written < 0 || written == count)
- goto out;
-
/*
- * direct-io write to a hole: fall through to buffered I/O
- * for completing the rest of the request.
+ * If the write stopped short of completing, fall back to
+ * buffered writes. Some filesystems do this for writes to
+ * holes, for example. For DAX files, a buffered write will
+ * not succeed (even if it did, DAX does not handle dirty
+ * page-cache pages correctly).
*/
+ if (written < 0 || written == count || IS_DAX(inode))
+ goto out;
+
pos += written;
count -= written;

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 09:35:17 UTC

Post by Matthew Wilcox
Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip(). Prevent I/Os to
files tagged with the DAX flag from falling back to buffered I/O.

I agree that DAX enabled FS should not silently fallback to buffered
I/O, since it would void some guarantees about persistency of data that
has been written to a DAX mmap()'d region.

Post by Matthew Wilcox
---
fs/ext2/inode.c | 9 ++++++---
fs/ext2/xip.h | 2 --
include/linux/fs.h | 6 ++++++
mm/filemap.c | 19 ++++++++++++-------
4 files changed, 24 insertions(+), 12 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 36d35c3..0cb0448 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
goto cleanup;
}
- if (ext2_use_xip(inode->i_sb)) {
+ if (IS_DAX(inode)) {
/*
* we need to clear the block
*/
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);
- if (mapping_is_xip(inode->i_mapping))
+ if (IS_DAX(inode))
error = xip_truncate_page(inode->i_mapping, newsize);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;
- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_DAX);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_flags |= S_DAX;
}
/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
}
int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e99e5c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1605,6 +1605,7 @@ struct super_operations {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
+#define S_DAX 8192 /* Direct Access, avoiding the page cache */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1642,6 +1643,11 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode) 0
+#endif
/*
* Inode state bits. Protected by inode->i_lock
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..fec4db9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1718,9 +1718,11 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
* we've already read everything we wanted to, or if
* there was a short read because we hit EOF, go ahead
* and return. Otherwise fallthrough to buffered io for
- * the rest of the read.
+ * the rest of the read. Buffered reads will not work for
+ * DAX files, so don't bother trying.
*/
- if (retval < 0 || !iov_iter_count(iter) || *ppos >= size) {
+ if (retval < 0 || !iov_iter_count(iter) || *ppos >= size ||
+ IS_DAX(inode)) {
file_accessed(file);
goto out;
}
@@ -2584,13 +2586,16 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
loff_t endbyte;
written = generic_file_direct_write(iocb, from, pos);
- if (written < 0 || written == count)
- goto out;
-
/*
- * direct-io write to a hole: fall through to buffered I/O
- * for completing the rest of the request.
+ * If the write stopped short of completing, fall back to
+ * buffered writes. Some filesystems do this for writes to
+ * holes, for example. For DAX files, a buffered write will
+ * not succeed (even if it did, DAX does not handle dirty
+ * page-cache pages correctly).
*/
+ if (written < 0 || written == count || IS_DAX(inode))
+ goto out;
+
pos += written;
count -= written;
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2014-09-25 20:33:23 UTC

From: Matthew Wilcox <***@linux.intel.com>

For DAX, we want to be able to copy between iovecs and kernel addresses
that don't necessarily have a struct page. This is a fairly simple
rearrangement for bvec iters to kmap the pages outside and pass them in,
but for user iovecs it gets more complicated because we might try various
different ways to kmap the memory. Duplicating the existing logic works
out best in this case.

We need to be able to write zeroes to an iovec for reads from unwritten
ranges in a file. This is performed by the new iov_iter_zero() function,
again patterned after the existing code that handles iovec iterators.

Signed-off-by: Matthew Wilcox <***@linux.intel.com>
---
include/linux/uio.h | 3 +
mm/iov_iter.c | 237 ++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 48d64e6..1863ddd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -80,6 +80,9 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index ab88dc0..d481fd8 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,96 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>

+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __copy_to_user(buf, from, copy);
+ copy -= left;
+ skip += copy;
+ from += copy;
+ bytes -= copy;
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __copy_to_user(buf, from, copy);
+ copy -= left;
+ skip = copy;
+ from += copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
+static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __copy_from_user(to, buf, copy);
+ copy -= left;
+ skip += copy;
+ to += copy;
+ bytes -= copy;
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __copy_from_user(to, buf, copy);
+ copy -= left;
+ skip = copy;
+ to += copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -166,6 +256,50 @@ done:
return wanted - bytes;
}

+static size_t zero_iovec(size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __clear_user(buf, copy);
+ copy -= left;
+ skip += copy;
+ bytes -= copy;
+
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __clear_user(buf, copy);
+ copy -= left;
+ skip = copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
{
@@ -412,12 +546,17 @@ static void memcpy_to_page(struct page *page, size_t offset, char *from, size_t
kunmap_atomic(to);
}

-static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t bytes,
- struct iov_iter *i)
+static void memzero_page(struct page *page, size_t offset, size_t len)
+{
+ char *addr = kmap_atomic(page);
+ memset(addr + offset, 0, len);
+ kunmap_atomic(addr);
+}
+
+static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
- void *kaddr, *from;

if (unlikely(bytes > i->count))
bytes = i->count;
@@ -430,8 +569,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
skip = i->iov_offset;
copy = min_t(size_t, bytes, bvec->bv_len - skip);

- kaddr = kmap_atomic(page);
- from = kaddr + offset;
memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
skip += copy;
from += copy;
@@ -444,7 +581,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
from += copy;
bytes -= copy;
}
- kunmap_atomic(kaddr);
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
@@ -456,12 +592,10 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
return wanted - bytes;
}

-static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t bytes,
- struct iov_iter *i)
+static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
- void *kaddr, *to;

if (unlikely(bytes > i->count))
bytes = i->count;
@@ -473,10 +607,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
bvec = i->bvec;
skip = i->iov_offset;

- kaddr = kmap_atomic(page);
-
- to = kaddr + offset;
-
copy = min(bytes, bvec->bv_len - skip);

memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
@@ -493,7 +623,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
to += copy;
bytes -= copy;
}
- kunmap_atomic(kaddr);
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
@@ -505,6 +634,61 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
return wanted;
}

+static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
+ kunmap_atomic(kaddr);
+ return wanted;
+}
+
+static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
+ kunmap_atomic(kaddr);
+ return wanted;
+}
+
+static size_t zero_bvec(size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, wanted;
+ const struct bio_vec *bvec;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ bvec = i->bvec;
+ skip = i->iov_offset;
+ copy = min_t(size_t, bytes, bvec->bv_len - skip);
+
+ memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
+ skip += copy;
+ bytes -= copy;
+ while (bytes) {
+ bvec++;
+ copy = min(bytes, (size_t)bvec->bv_len);
+ memzero_page(bvec->bv_page, bvec->bv_offset, copy);
+ skip = copy;
+ bytes -= copy;
+ }
+ if (skip == bvec->bv_len) {
+ bvec++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= bvec - i->bvec;
+ i->bvec = bvec;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t copy_from_user_bvec(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
@@ -668,6 +852,31 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
}
EXPORT_SYMBOL(copy_page_from_iter);

+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC)
+ return copy_to_iter_bvec(addr, bytes, i);
+ else
+ return copy_to_iter_iovec(addr, bytes, i);
+}
+
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC)
+ return copy_from_iter_bvec(addr, bytes, i);
+ else
+ return copy_from_iter_iovec(addr, bytes, i);
+}
+
+size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC) {
+ return zero_bvec(bytes, i);
+ } else {
+ return zero_iovec(bytes, i);
+ }
+}
+
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 13:33:55 UTC

Post by Matthew Wilcox
For DAX, we want to be able to copy between iovecs and kernel addresses
that don't necessarily have a struct page. This is a fairly simple
rearrangement for bvec iters to kmap the pages outside and pass them in,
but for user iovecs it gets more complicated because we might try various
different ways to kmap the memory. Duplicating the existing logic works
out best in this case.
We need to be able to write zeroes to an iovec for reads from unwritten
ranges in a file. This is performed by the new iov_iter_zero() function,
again patterned after the existing code that handles iovec iterators.
---
include/linux/uio.h | 3 +
mm/iov_iter.c | 237 ++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 226 insertions(+), 14 deletions(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 48d64e6..1863ddd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -80,6 +80,9 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index ab88dc0..d481fd8 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,96 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __copy_to_user(buf, from, copy);

How comes this function uses __copy_to_user without any access_ok()
check ? This has security implications.

Post by Matthew Wilcox
+ copy -= left;
+ skip += copy;
+ from += copy;
+ bytes -= copy;
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __copy_to_user(buf, from, copy);

same here.

Post by Matthew Wilcox
+ copy -= left;
+ skip = copy;
+ from += copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
+static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __copy_from_user(to, buf, copy);

same here.

Post by Matthew Wilcox
+ copy -= left;
+ skip += copy;
+ to += copy;
+ bytes -= copy;
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __copy_from_user(to, buf, copy);

same.

Post by Matthew Wilcox
+ copy -= left;
+ skip = copy;
+ to += copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
return wanted - bytes;
}
+static size_t zero_iovec(size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __clear_user(buf, copy);

I would guess an access_ok() would be needed here too.

Post by Matthew Wilcox
+ copy -= left;
+ skip += copy;
+ bytes -= copy;
+
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __clear_user(buf, copy);

Same.

Post by Matthew Wilcox
+ copy -= left;
+ skip = copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
{
@@ -412,12 +546,17 @@ static void memcpy_to_page(struct page *page, size_t offset, char *from, size_t
kunmap_atomic(to);
}
-static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t bytes,
- struct iov_iter *i)
+static void memzero_page(struct page *page, size_t offset, size_t len)
+{
+ char *addr = kmap_atomic(page);
+ memset(addr + offset, 0, len);
+ kunmap_atomic(addr);
+}
+
+static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
- void *kaddr, *from;
if (unlikely(bytes > i->count))
bytes = i->count;
@@ -430,8 +569,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
skip = i->iov_offset;
copy = min_t(size_t, bytes, bvec->bv_len - skip);
- kaddr = kmap_atomic(page);
- from = kaddr + offset;
memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
skip += copy;
from += copy;
@@ -444,7 +581,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
from += copy;
bytes -= copy;
}
- kunmap_atomic(kaddr);
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
@@ -456,12 +592,10 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
return wanted - bytes;
}
-static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t bytes,
- struct iov_iter *i)
+static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
- void *kaddr, *to;
if (unlikely(bytes > i->count))
bytes = i->count;
@@ -473,10 +607,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
bvec = i->bvec;
skip = i->iov_offset;
- kaddr = kmap_atomic(page);
-
- to = kaddr + offset;
-
copy = min(bytes, bvec->bv_len - skip);
memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
@@ -493,7 +623,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
to += copy;
bytes -= copy;
}
- kunmap_atomic(kaddr);
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
@@ -505,6 +634,61 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
return wanted;
}
+static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);

missing newline.

Post by Matthew Wilcox
+ kunmap_atomic(kaddr);
+ return wanted;
+}
+
+static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);

missing newline.

Thanks,

Mathieu

Post by Matthew Wilcox
+ kunmap_atomic(kaddr);
+ return wanted;
+}
+
+static size_t zero_bvec(size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, wanted;
+ const struct bio_vec *bvec;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ bvec = i->bvec;
+ skip = i->iov_offset;
+ copy = min_t(size_t, bytes, bvec->bv_len - skip);
+
+ memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
+ skip += copy;
+ bytes -= copy;
+ while (bytes) {
+ bvec++;
+ copy = min(bytes, (size_t)bvec->bv_len);
+ memzero_page(bvec->bv_page, bvec->bv_offset, copy);
+ skip = copy;
+ bytes -= copy;
+ }
+ if (skip == bvec->bv_len) {
+ bvec++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= bvec - i->bvec;
+ i->bvec = bvec;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t copy_from_user_bvec(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
@@ -668,6 +852,31 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
}
EXPORT_SYMBOL(copy_page_from_iter);
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC)
+ return copy_to_iter_bvec(addr, bytes, i);
+ else
+ return copy_to_iter_iovec(addr, bytes, i);
+}
+
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC)
+ return copy_from_iter_bvec(addr, bytes, i);
+ else
+ return copy_from_iter_iovec(addr, bytes, i);
+}
+
+size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC) {
+ return zero_bvec(bytes, i);
+ } else {
+ return zero_iovec(bytes, i);
+ }
+}
+
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

Matthew Wilcox

2014-10-16 13:59:03 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{

[...]

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ left = __copy_to_user(buf, from, copy);

How comes this function uses __copy_to_user without any access_ok()
check ? This has security implications.

The access_ok() check is done higher up the call-chain if it's appropriate.
These functions can be (intentionally) called to access kernel addresses,
so it wouldn't be appropriate to do that here.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);

missing newline.

Post by Matthew Wilcox
+ kunmap_atomic(kaddr);
+ return wanted;
+}

Are you seriously suggesting that:

static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
size_t bytes, struct iov_iter *i)
{
void *kaddr = kmap_atomic(page);
size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);

kunmap_atomic(kaddr);
return wanted;
}

is more readable than without the newline? I can see the point of the
rule for functions with a lot of variables, or a lot of lines, but I
don't see the point of it for such a small function.

In any case, this patch is now upstream, so I shan't be proposing any
stylistic changes for it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 14:12:06 UTC

----- Original Message -----

Sent: Thursday, October 16, 2014 3:59:03 PM
Subject: Re: [PATCH v11 06/21] vfs: Add copy_to_iter(), copy_from_iter() and iov_iter_zero()

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{

[...]

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ left = __copy_to_user(buf, from, copy);

How comes this function uses __copy_to_user without any access_ok()
check ? This has security implications.

The access_ok() check is done higher up the call-chain if it's appropriate.
These functions can be (intentionally) called to access kernel addresses,
so it wouldn't be appropriate to do that here.

If the access_ok() are expected to be already done higher in the call-chain,
we might want to rename e.g. copy_to_iter_iovec to
__copy_to_iter_iovec(). It helps clarifying the check expectations for the
caller.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);

missing newline.

Post by Matthew Wilcox
+ kunmap_atomic(kaddr);
+ return wanted;
+}

static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
size_t bytes, struct iov_iter *i)
{
void *kaddr = kmap_atomic(page);
size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
kunmap_atomic(kaddr);
return wanted;
}
is more readable than without the newline? I can see the point of the
rule for functions with a lot of variables, or a lot of lines, but I
don't see the point of it for such a small function.

I usually find it easier to read when variables and code are split,
but I don't feel strongly about this in this particular case.

In any case, this patch is now upstream, so I shan't be proposing any
stylistic changes for it.

The leading __ prefix before the function names appears to be important
enough though, since it allows future changes of this code to take into
account the specific check expectations of those functions.

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 22:21:46 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
The access_ok() check is done higher up the call-chain if it's appropriate.
These functions can be (intentionally) called to access kernel addresses,
so it wouldn't be appropriate to do that here.

If the access_ok() are expected to be already done higher in the call-chain,
we might want to rename e.g. copy_to_iter_iovec to
__copy_to_iter_iovec(). It helps clarifying the check expectations for the
caller.

I'm following the existing convention in this file; it already had
copy_page_to_iter() and copy_page_from_iter() as exported symbols. I
just added copy_to_iter() and copy_from_iter().

Mathieu Desnoyers

2014-10-17 15:39:37 UTC

----- Original Message -----

Sent: Friday, October 17, 2014 12:21:46 AM
Subject: Re: [PATCH v11 06/21] vfs: Add copy_to_iter(), copy_from_iter() and iov_iter_zero()

Post by Mathieu Desnoyers

Post by Matthew Wilcox
The access_ok() check is done higher up the call-chain if it's appropriate.
These functions can be (intentionally) called to access kernel addresses,
so it wouldn't be appropriate to do that here.

If the access_ok() are expected to be already done higher in the call-chain,
we might want to rename e.g. copy_to_iter_iovec to
__copy_to_iter_iovec(). It helps clarifying the check expectations for the
caller.

I'm following the existing convention in this file; it already had
copy_page_to_iter() and copy_page_from_iter() as exported symbols. I
just added copy_to_iter() and copy_from_iter().

I understand you follow the local style. However, since these style
nits have been known to let security issues creep into the kernel
in the past,it would be good to change the style of this file to add
those also to the pre-existing functions, perhaps in a separate patch.

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:34 UTC

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/ext2/ext2.h | 1 -
fs/ext2/inode.c | 7 +------
fs/ext2/namei.c | 4 ++--
3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
extern const struct address_space_operations ext2_nobh_aops;

/* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 154cbcf..034fd42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = {
.error_remove_page = generic_error_remove_page,
};

-const struct address_space_operations ext2_aops_xip = {
- .bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
-};
-
const struct address_space_operations ext2_nobh_aops = {
.readpage = ext2_readpage,
.readpages = ext2_readpages,
@@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:29:08 UTC

Post by Matthew Wilcox
We shouldn't need a special address_space_operations any more
---
fs/ext2/ext2.h | 1 -
fs/ext2/inode.c | 7 +------
fs/ext2/namei.c | 4 ++--
3 files changed, 3 insertions(+), 9 deletions(-)
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;
/* inode.c */
extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
extern const struct address_space_operations ext2_nobh_aops;
/* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 154cbcf..034fd42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = {
.error_remove_page = generic_error_remove_page,
};
-const struct address_space_operations ext2_aops_xip = {
- .bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
-};
-
const struct address_space_operations ext2_nobh_aops = {
.readpage = ext2_readpage,
.readpages = ext2_readpages,
@@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:32 UTC

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/ext2/Makefile | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/namei.c | 1 -
fs/ext2/super.c | 1 -
fs/ext2/xip.c | 15 ---------------
fs/ext2/xip.h | 16 ----------------
6 files changed, 35 deletions(-)
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP) += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cba3833..154cbcf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
#include <linux/aio.h>
#include "ext2.h"
#include "acl.h"
-#include "xip.h"
#include "xattr.h"

static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
{
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d862031..0393c6d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static void ext2_sync_super(struct super_block *sb,
struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (***@de.ibm.com)
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- * linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (***@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
- return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb) 0
-#endif

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:21:15 UTC

Post by Matthew Wilcox
These files are now empty, so delete them
---
fs/ext2/Makefile | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/namei.c | 1 -
fs/ext2/super.c | 1 -
fs/ext2/xip.c | 15 ---------------
fs/ext2/xip.h | 16 ----------------
6 files changed, 35 deletions(-)
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h
diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP) += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cba3833..154cbcf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
#include <linux/aio.h>
#include "ext2.h"
#include "acl.h"
-#include "xip.h"
#include "xattr.h"
static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"
static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
{
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d862031..0393c6d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"
static void ext2_sync_super(struct super_block *sb,
struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- * linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
- return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb) 0
-#endif
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:29 UTC

All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty.

Signed-off-by: Matthew Wilcox <***@intel.com>
---
Documentation/filesystems/Locking | 3 ---
fs/exofs/inode.c | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/xip.c | 45 ---------------------------------------
fs/ext2/xip.h | 3 ---
fs/open.c | 5 +----
include/linux/fs.h | 2 --
mm/Makefile | 1 -
mm/fadvise.c | 6 ++++--
mm/filemap_xip.c | 23 --------------------
mm/madvise.c | 2 +-
11 files changed, 6 insertions(+), 86 deletions(-)
delete mode 100644 mm/filemap_xip.c

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9..226ccc3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -197,8 +197,6 @@ prototypes:
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
- unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
@@ -223,7 +221,6 @@ invalidatepage: yes
releasepage: yes
freepage: yes
direct_IO:
-get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3f9cafd..c408a53 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
.direct_IO = exofs_direct_IO,

/* With these NULL has special meaning or default is not exported */
- .get_xip_mem = NULL,
.migratepage = NULL,
.launder_page = NULL,
.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ac0a34..59d6c7d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = {

const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_mem = ext2_get_xip_mem,
.direct_IO = ext2_direct_IO,
};

diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 8cfca3a..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,35 +13,6 @@
#include "ext2.h"
#include "xip.h"

-static inline long __inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn, long size)
-{
- struct block_device *bdev = inode->i_sb->s_bdev;
- sector_t sector = block * (PAGE_SIZE / 512);
- return bdev_direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head tmp;
- int rc;
-
- memset(&tmp, 0, sizeof(struct buffer_head));
- tmp.b_size = 1 << inode->i_blkbits;
- rc = ext2_get_block(inode, pgoff, &tmp, create);
- *result = tmp.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!tmp.b_blocknr && !rc) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
"not supported by bdev");
}
}
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
- void **kmem, unsigned long *pfn)
-{
- long rc;
- sector_t block;
-
- /* first, retrieve the sector number */
- rc = __ext2_get_block(mapping->host, pgoff, create, &block);
- if (rc)
- return rc;
-
- /* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
- return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index b2592f2..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb)
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_get_xip_mem NULL
#endif
diff --git a/fs/open.c b/fs/open.c
index d6fd3ac..ca68e47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -655,11 +655,8 @@ int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_mem))) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
return -EINVAL;
- }
}
return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eee848d..d73db11 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -349,8 +349,6 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/Makefile b/mm/Makefile
index 632ae77..b2c7623 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
struct fd f = fdget(fd);
+ struct inode *inode;
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
if (!f.file)
return -EBADF;

- if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+ inode = file_inode(f.file);
+ if (S_ISFIFO(inode->i_mode)) {
ret = -ESPIPE;
goto out;
}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
goto out;
}

- if (mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(inode)) {
switch (advice) {
case POSIX_FADV_NORMAL:
case POSIX_FADV_RANDOM:
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- * linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <***@de.ibm.com>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..1611ebf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
if (!file)
return -EBADF;

- if (file->f_mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(file_inode(file))) {
/* no bad return value, but ignore advice */
return 0;
}

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:14:46 UTC

Post by Matthew Wilcox
All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty.
---
Documentation/filesystems/Locking | 3 ---
fs/exofs/inode.c | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/xip.c | 45 ---------------------------------------
fs/ext2/xip.h | 3 ---
fs/open.c | 5 +----
include/linux/fs.h | 2 --
mm/Makefile | 1 -
mm/fadvise.c | 6 ++++--
mm/filemap_xip.c | 23 --------------------
mm/madvise.c | 2 +-
11 files changed, 6 insertions(+), 86 deletions(-)
delete mode 100644 mm/filemap_xip.c
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9..226ccc3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
- unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
@@ -223,7 +221,6 @@ invalidatepage: yes
releasepage: yes
freepage: yes
-get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3f9cafd..c408a53 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
.direct_IO = exofs_direct_IO,
/* With these NULL has special meaning or default is not exported */
- .get_xip_mem = NULL,
.migratepage = NULL,
.launder_page = NULL,
.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ac0a34..59d6c7d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = {
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_mem = ext2_get_xip_mem,
.direct_IO = ext2_direct_IO,
};
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 8cfca3a..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,35 +13,6 @@
#include "ext2.h"
#include "xip.h"
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn, long size)
-{
- struct block_device *bdev = inode->i_sb->s_bdev;
- sector_t sector = block * (PAGE_SIZE / 512);
- return bdev_direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head tmp;
- int rc;
-
- memset(&tmp, 0, sizeof(struct buffer_head));
- tmp.b_size = 1 << inode->i_blkbits;
- rc = ext2_get_block(inode, pgoff, &tmp, create);
- *result = tmp.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!tmp.b_blocknr && !rc) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
"not supported by bdev");
}
}
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
- void **kmem, unsigned long *pfn)
-{
- long rc;
- sector_t block;
-
- /* first, retrieve the sector number */
- rc = __ext2_get_block(mapping->host, pgoff, create, &block);
- if (rc)
- return rc;
-
- /* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
- return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index b2592f2..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb)
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_get_xip_mem NULL
#endif
diff --git a/fs/open.c b/fs/open.c
index d6fd3ac..ca68e47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -655,11 +655,8 @@ int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_mem))) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)

Why is it OK to remove the check for get_xip_mem callback here, rather
than replacing it with a IS_DAX check like the rest of this patch does ?
I'm probably missing something.

Thanks,

Mathieu

Post by Matthew Wilcox
return -EINVAL;
- }
}
return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eee848d..d73db11 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -349,8 +349,6 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/Makefile b/mm/Makefile
index 632ae77..b2c7623 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
struct fd f = fdget(fd);
+ struct inode *inode;
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
if (!f.file)
return -EBADF;
- if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+ inode = file_inode(f.file);
+ if (S_ISFIFO(inode->i_mode)) {
ret = -ESPIPE;
goto out;
}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
goto out;
}
- if (mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(inode)) {
switch (advice) {
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- * linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..1611ebf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
if (!file)
return -EBADF;
- if (file->f_mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(file_inode(file))) {
/* no bad return value, but ignore advice */
return 0;
}
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2014-10-16 21:44:10 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+++ b/fs/open.c
@@ -655,11 +655,8 @@ int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_mem))) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)

Why is it OK to remove the check for get_xip_mem callback here, rather
than replacing it with a IS_DAX check like the rest of this patch does ?
I'm probably missing something.

XIP used to intercept I/Os by having the filesystem's ->read & ->write
methods call xip_file_read (/write). That would do the I/O, and so there
was no need to have a ->direct_IO element in a_ops. For DAX, we use the
generic VFS code to call back into the filesystem's ->direct_IO entry
point, so the check above for ->direct_IO now checks for both regular
and DAX support.

Or to put it another way, DAX now requires that the filesystem support
O_DIRECT. Which is pretty much the way it has to be anyway, since DAX
is direct!

Matthew Wilcox

2014-09-25 20:33:28 UTC

From: Matthew Wilcox <***@linux.intel.com>

Based on the original XIP documentation, this documents the current
state of affairs, and includes instructions on how users can enable DAX
if their devices and kernel support it.

Signed-off-by: Matthew Wilcox <***@linux.intel.com>
Reviewed-by: Randy Dunlap <***@infradead.org>
---
Documentation/filesystems/dax.txt | 89 +++++++++++++++++++++++++++++++++++++++
Documentation/filesystems/xip.txt | 71 -------------------------------
2 files changed, 89 insertions(+), 71 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..635adaa
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,89 @@
+Direct Access for files
+-----------------------
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage. The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual. When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+ i_flags
+- implementing the direct_IO address space operation, and calling
+ dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+ VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+ for fault and page_mkwrite (which should probably call dax_fault() and
+ dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents. If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages. This problem is being worked on. That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here). Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b774729..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory. It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that can be contiguously accessed at that offset. It may also
-return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:08:20 UTC

Post by Matthew Wilcox
Based on the original XIP documentation, this documents the current
state of affairs, and includes instructions on how users can enable DAX
if their devices and kernel support it.
---
Documentation/filesystems/dax.txt | 89 +++++++++++++++++++++++++++++++++++++++
Documentation/filesystems/xip.txt | 71 -------------------------------
2 files changed, 89 insertions(+), 71 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..635adaa
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,89 @@
+Direct Access for files
+-----------------------
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage. The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual. When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver

Perhaps we might want to add some quickstart info on how to use this
with memory reserved by a boot parameter, and also how to use this with
pstore style of memory area (if it applies) ? Expanding the info given
on how to use this with RAM backed block device driver might encourage
more people to try it out and test it.

Walking users to the initial process of checking if their BIOS resets
memory upon soft reboots might be good to have here too.

Thanks,

Mathieu

Post by Matthew Wilcox
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+ i_flags
+- implementing the direct_IO address space operation, and calling
+ dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+ VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+ for fault and page_mkwrite (which should probably call dax_fault() and
+ dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents. If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages. This problem is being worked on. That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here). Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b774729..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory. It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that can be contiguously accessed at that offset. It may also
-return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:26 UTC

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
---
fs/dax.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 35 +++++++-
include/linux/fs.h | 4 +-
mm/filemap_xip.c | 206 -----------------------------------------------
4 files changed, 268 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 02e226f..ac5d3a6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,9 +19,13 @@
#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/uio.h>
+#include <linux/vmstat.h>

int dax_clear_blocks(struct inode *inode, sector_t block, long size)
{
@@ -228,3 +232,231 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
return retval;
}
EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file. Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files. We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct page *page,
+ struct vm_fault *vmf)
+{
+ unsigned long size;
+ struct inode *inode = mapping->host;
+ if (!page)
+ page = find_or_create_page(mapping, vmf->pgoff,
+ GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return VM_FAULT_OOM;
+ /* Recheck i_size under page lock to avoid truncate race */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return VM_FAULT_LOCKED;
+}
+
+static int copy_user_bh(struct page *to, struct buffer_head *bh,
+ unsigned blkbits, unsigned long vaddr)
+{
+ void *vfrom, *vto;
+ if (dax_get_addr(bh, &vfrom, blkbits) < 0)
+ return -EIO;
+ vto = kmap_atomic(to);
+ copy_user_page(vto, vfrom, vaddr, to);
+ kunmap_atomic(vto);
+ return 0;
+}
+
+static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+ struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct address_space *mapping = inode->i_mapping;
+ sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ void *addr;
+ unsigned long pfn;
+ pgoff_t size;
+ int error;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+
+ /*
+ * Check truncate didn't happen while we were allocating a block.
+ * If it did, this block may or may not be still allocated to the
+ * file. We can't tell the filesystem to free it because we can't
+ * take i_mutex here. In the worst case, the file still has blocks
+ * allocated past the end of the file.
+ */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ error = -EIO;
+ goto out;
+ }
+
+ error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
+ if (error < 0)
+ goto out;
+ if (error < PAGE_SIZE) {
+ error = -EIO;
+ goto out;
+ }
+
+ if (buffer_unwritten(bh) || buffer_new(bh))
+ clear_page(addr);
+
+ error = vm_insert_mixed(vma, vaddr, pfn);
+
+ out:
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ if (bh->b_end_io)
+ bh->b_end_io(bh, 1);
+
+ return error;
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ struct file *file = vma->vm_file;
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = file->f_mapping;
+ struct page *page;
+ struct buffer_head bh;
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ unsigned blkbits = inode->i_blkbits;
+ sector_t block;
+ pgoff_t size;
+ int error;
+ int major = 0;
+
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ return VM_FAULT_SIGBUS;
+
+ memset(&bh, 0, sizeof(bh));
+ block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+ bh.b_size = PAGE_SIZE;
+
+ repeat:
+ page = find_get_page(mapping, vmf->pgoff);
+ if (page) {
+ if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+ page_cache_release(page);
+ return VM_FAULT_RETRY;
+ }
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
+ }
+ }
+
+ error = get_block(inode, block, &bh, 0);
+ if (!error && (bh.b_size < PAGE_SIZE))
+ error = -EIO;
+ if (error)
+ goto unlock_page;
+
+ if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
+ if (!error && (bh.b_size < PAGE_SIZE))
+ error = -EIO;
+ if (error)
+ goto unlock_page;
+ } else {
+ return dax_load_hole(mapping, page, vmf);
+ }
+ }
+
+ if (vmf->cow_page) {
+ struct page *new_page = vmf->cow_page;
+ if (buffer_written(&bh))
+ error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+ else
+ clear_user_highpage(new_page, vaddr);
+ if (error)
+ goto unlock_page;
+ vmf->page = page;
+ if (!page) {
+ mutex_lock(&mapping->i_mmap_mutex);
+ /* Check we didn't race with truncate */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >>
+ PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ error = -EIO;
+ goto out;
+ }
+ }
+ return VM_FAULT_LOCKED;
+ }
+
+ /* Check we didn't race with a read fault installing a new page */
+ if (!page && major)
+ page = find_lock_page(mapping, vmf->pgoff);
+
+ if (page) {
+ unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ delete_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
+ error = dax_insert_mapping(inode, &bh, vma, vmf);
+
+ out:
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM | major;
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if ((error < 0) && (error != -EBUSY))
+ return VM_FAULT_SIGBUS | major;
+ return VM_FAULT_NOPAGE | major;
+
+ unlock_page:
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ goto out;
+}
+
+/**
+ * dax_fault - handle a page fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ }
+ result = do_dax_fault(vma, vmf, get_block);
+ if (vmf->flags & FAULT_FLAG_WRITE)
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a247123..da8dc64 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
#include "xattr.h"
#include "acl.h"

+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+ .fault = ext2_dax_fault,
+ .page_mkwrite = ext2_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (!IS_DAX(file_inode(file)))
+ return generic_file_mmap(file, vma);
+
+ file_accessed(file);
+ vma->vm_ops = &ext2_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ return 0;
+}
+#else
+#define ext2_file_mmap generic_file_mmap
+#endif
+
/*
* Called when filp is released. This happens when all file descriptors
* for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = xip_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c04d371..338f04b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -49,6 +49,7 @@ struct swap_info_struct;
struct seq_file;
struct workqueue_struct;
struct iov_iter;
+struct vm_fault;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2491,10 +2492,11 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+#define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)
#else
static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
{
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
#include <asm/io.h>

/*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
- if (!__xip_sparse_page) {
- struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
- if (page)
- __xip_sparse_page = page;
- }
- return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
- unsigned long pgoff)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm;
- unsigned long address;
- pte_t *pte;
- pte_t pteval;
- spinlock_t *ptl;
- struct page *page;
- unsigned count;
- int locked = 0;
-
- count = read_seqcount_begin(&xip_sparse_seq);
-
- page = __xip_sparse_page;
- if (!page)
- return;
-
-retry:
- mutex_lock(&mapping->i_mmap_mutex);
- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- mm = vma->vm_mm;
- address = vma->vm_start +
- ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- pte = page_check_address(page, mm, address, &ptl, 1);
- if (pte) {
- /* Nuke the page table entry. */
- flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
- BUG_ON(pte_dirty(pteval));
- pte_unmap_unlock(pte, ptl);
- /* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
- page_cache_release(page);
- }
- }
- mutex_unlock(&mapping->i_mmap_mutex);
-
- if (locked) {
- mutex_unlock(&xip_sparse_mutex);
- } else if (read_seqcount_retry(&xip_sparse_seq, count)) {
- mutex_lock(&xip_sparse_mutex);
- locked = 1;
- goto retry;
- }
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- pgoff_t size;
- void *xip_mem;
- unsigned long xip_pfn;
- struct page *page;
- int error;
-
- /* XXX: are VM_FAULT_ codes OK? */
-again:
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (vmf->pgoff >= size)
- return VM_FAULT_SIGBUS;
-
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (likely(!error))
- goto found;
- if (error != -ENODATA)
- return VM_FAULT_OOM;
-
- /* sparse block */
- if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
- (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
- int err;
-
- /* maybe shared writable, allocate new block */
- mutex_lock(&xip_sparse_mutex);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (error)
- return VM_FAULT_SIGBUS;
- /* unmap sparse mappings at pgoff from all other vmas */
- __xip_unmap(mapping, vmf->pgoff);
-
-found:
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- return VM_FAULT_SIGBUS;
- }
- err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
- xip_pfn);
- mutex_unlock(&mapping->i_mmap_mutex);
- if (err == -ENOMEM)
- return VM_FAULT_OOM;
- /*
- * err == -EBUSY is fine, we've raced against another thread
- * that faulted-in the same page
- */
- if (err != -EBUSY)
- BUG_ON(err);
- return VM_FAULT_NOPAGE;
- } else {
- int err, ret = VM_FAULT_OOM;
-
- mutex_lock(&xip_sparse_mutex);
- write_seqcount_begin(&xip_sparse_seq);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(!error)) {
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
- goto again;
- }
- if (error != -ENODATA)
- goto out;
-
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- ret = VM_FAULT_SIGBUS;
- goto unlock;
- }
- /* not shared and writable, use xip_sparse_page() */
- page = xip_sparse_page();
- if (!page)
- goto unlock;
- err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
- page);
- if (err == -ENOMEM)
- goto unlock;
-
- ret = VM_FAULT_NOPAGE;
-unlock:
- mutex_unlock(&mapping->i_mmap_mutex);
-out:
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
-
- return ret;
- }
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
- .fault = xip_file_fault,
- .page_mkwrite = filemap_page_mkwrite,
- .remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
- BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
- file_accessed(file);
- vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
* to get the page instead of page cache

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 10:20:47 UTC

Post by Matthew Wilcox
Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.
---
fs/dax.c | 232 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 35 +++++++-
include/linux/fs.h | 4 +-
mm/filemap_xip.c | 206 -----------------------------------------------
4 files changed, 268 insertions(+), 209 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 02e226f..ac5d3a6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,9 +19,13 @@
#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/uio.h>
+#include <linux/vmstat.h>
int dax_clear_blocks(struct inode *inode, sector_t block, long size)
{
@@ -228,3 +232,231 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
return retval;
}
EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file. Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files. We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.

Nice trick :)

Post by Matthew Wilcox
+ */
+static int dax_load_hole(struct address_space *mapping, struct page *page,
+ struct vm_fault *vmf)
+{
+ unsigned long size;
+ struct inode *inode = mapping->host;

missing newline.

Post by Matthew Wilcox
+ if (!page)
+ page = find_or_create_page(mapping, vmf->pgoff,
+ GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return VM_FAULT_OOM;
+ /* Recheck i_size under page lock to avoid truncate race */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return VM_FAULT_LOCKED;
+}
+
+static int copy_user_bh(struct page *to, struct buffer_head *bh,
+ unsigned blkbits, unsigned long vaddr)
+{
+ void *vfrom, *vto;

missing newline.

Post by Matthew Wilcox
+ if (dax_get_addr(bh, &vfrom, blkbits) < 0)
+ return -EIO;
+ vto = kmap_atomic(to);
+ copy_user_page(vto, vfrom, vaddr, to);
+ kunmap_atomic(vto);
+ return 0;
+}
+
+static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+ struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct address_space *mapping = inode->i_mapping;
+ sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ void *addr;
+ unsigned long pfn;
+ pgoff_t size;
+ int error;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+
+ /*
+ * Check truncate didn't happen while we were allocating a block.
+ * If it did, this block may or may not be still allocated to the
+ * file. We can't tell the filesystem to free it because we can't
+ * take i_mutex here. In the worst case, the file still has blocks
+ * allocated past the end of the file.
+ */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ error = -EIO;
+ goto out;
+ }
+
+ error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
+ if (error < 0)
+ goto out;
+ if (error < PAGE_SIZE) {
+ error = -EIO;
+ goto out;
+ }
+
+ if (buffer_unwritten(bh) || buffer_new(bh))
+ clear_page(addr);
+
+ error = vm_insert_mixed(vma, vaddr, pfn);
+
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ if (bh->b_end_io)
+ bh->b_end_io(bh, 1);
+
+ return error;
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ struct file *file = vma->vm_file;
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = file->f_mapping;
+ struct page *page;
+ struct buffer_head bh;
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ unsigned blkbits = inode->i_blkbits;

unsigned -> unsigned int

Post by Matthew Wilcox
+ sector_t block;
+ pgoff_t size;
+ int error;
+ int major = 0;
+
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ return VM_FAULT_SIGBUS;
+
+ memset(&bh, 0, sizeof(bh));
+ block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+ bh.b_size = PAGE_SIZE;
+
+ page = find_get_page(mapping, vmf->pgoff);
+ if (page) {
+ if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+ page_cache_release(page);
+ return VM_FAULT_RETRY;
+ }
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
+ }
+ }
+
+ error = get_block(inode, block, &bh, 0);
+ if (!error && (bh.b_size < PAGE_SIZE))
+ error = -EIO;
+ if (error)
+ goto unlock_page;
+
+ if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
+ if (!error && (bh.b_size < PAGE_SIZE))
+ error = -EIO;
+ if (error)
+ goto unlock_page;
+ } else {
+ return dax_load_hole(mapping, page, vmf);
+ }
+ }
+
+ if (vmf->cow_page) {
+ struct page *new_page = vmf->cow_page;

add newline.

Post by Matthew Wilcox
+ if (buffer_written(&bh))
+ error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+ else
+ clear_user_highpage(new_page, vaddr);
+ if (error)
+ goto unlock_page;
+ vmf->page = page;
+ if (!page) {
+ mutex_lock(&mapping->i_mmap_mutex);
+ /* Check we didn't race with truncate */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >>
+ PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ error = -EIO;
+ goto out;
+ }
+ }

If page is non-NULL, is it possible that we return VM_FAULT_LOCKED
without actually holding i_mmap_mutex ? Is it on purpose ?

Post by Matthew Wilcox
+ return VM_FAULT_LOCKED;
+ }

Thanks,

Mathieu

Post by Matthew Wilcox
+
+ /* Check we didn't race with a read fault installing a new page */
+ if (!page && major)
+ page = find_lock_page(mapping, vmf->pgoff);
+
+ if (page) {
+ unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ delete_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
+ error = dax_insert_mapping(inode, &bh, vma, vmf);
+
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM | major;
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if ((error < 0) && (error != -EBUSY))
+ return VM_FAULT_SIGBUS | major;
+ return VM_FAULT_NOPAGE | major;
+
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ goto out;
+}
+
+/**
+ * dax_fault - handle a page fault on a DAX file
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ }
+ result = do_dax_fault(vma, vmf, get_block);
+ if (vmf->flags & FAULT_FLAG_WRITE)
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a247123..da8dc64 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
#include "xattr.h"
#include "acl.h"
+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+ .fault = ext2_dax_fault,
+ .page_mkwrite = ext2_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (!IS_DAX(file_inode(file)))
+ return generic_file_mmap(file, vma);
+
+ file_accessed(file);
+ vma->vm_ops = &ext2_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ return 0;
+}
+#else
+#define ext2_file_mmap generic_file_mmap
+#endif
+
/*
* Called when filp is released. This happens when all file descriptors
* for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = xip_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c04d371..338f04b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -49,6 +49,7 @@ struct swap_info_struct;
struct seq_file;
struct workqueue_struct;
struct iov_iter;
+struct vm_fault;
extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2491,10 +2492,11 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
#ifdef CONFIG_FS_XIP
int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+#define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)
#else
static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
{
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
#include <asm/io.h>
/*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
- if (!__xip_sparse_page) {
- struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
- if (page)
- __xip_sparse_page = page;
- }
- return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
- unsigned long pgoff)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm;
- unsigned long address;
- pte_t *pte;
- pte_t pteval;
- spinlock_t *ptl;
- struct page *page;
- unsigned count;
- int locked = 0;
-
- count = read_seqcount_begin(&xip_sparse_seq);
-
- page = __xip_sparse_page;
- if (!page)
- return;
-
- mutex_lock(&mapping->i_mmap_mutex);
- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- mm = vma->vm_mm;
- address = vma->vm_start +
- ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- pte = page_check_address(page, mm, address, &ptl, 1);
- if (pte) {
- /* Nuke the page table entry. */
- flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
- BUG_ON(pte_dirty(pteval));
- pte_unmap_unlock(pte, ptl);
- /* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
- page_cache_release(page);
- }
- }
- mutex_unlock(&mapping->i_mmap_mutex);
-
- if (locked) {
- mutex_unlock(&xip_sparse_mutex);
- } else if (read_seqcount_retry(&xip_sparse_seq, count)) {
- mutex_lock(&xip_sparse_mutex);
- locked = 1;
- goto retry;
- }
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- pgoff_t size;
- void *xip_mem;
- unsigned long xip_pfn;
- struct page *page;
- int error;
-
- /* XXX: are VM_FAULT_ codes OK? */
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (vmf->pgoff >= size)
- return VM_FAULT_SIGBUS;
-
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (likely(!error))
- goto found;
- if (error != -ENODATA)
- return VM_FAULT_OOM;
-
- /* sparse block */
- if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
- (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
- int err;
-
- /* maybe shared writable, allocate new block */
- mutex_lock(&xip_sparse_mutex);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (error)
- return VM_FAULT_SIGBUS;
- /* unmap sparse mappings at pgoff from all other vmas */
- __xip_unmap(mapping, vmf->pgoff);
-
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- return VM_FAULT_SIGBUS;
- }
- err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
- xip_pfn);
- mutex_unlock(&mapping->i_mmap_mutex);
- if (err == -ENOMEM)
- return VM_FAULT_OOM;
- /*
- * err == -EBUSY is fine, we've raced against another thread
- * that faulted-in the same page
- */
- if (err != -EBUSY)
- BUG_ON(err);
- return VM_FAULT_NOPAGE;
- } else {
- int err, ret = VM_FAULT_OOM;
-
- mutex_lock(&xip_sparse_mutex);
- write_seqcount_begin(&xip_sparse_seq);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(!error)) {
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
- goto again;
- }
- if (error != -ENODATA)
- goto out;
-
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- ret = VM_FAULT_SIGBUS;
- goto unlock;
- }
- /* not shared and writable, use xip_sparse_page() */
- page = xip_sparse_page();
- if (!page)
- goto unlock;
- err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
- page);
- if (err == -ENOMEM)
- goto unlock;
-
- ret = VM_FAULT_NOPAGE;
- mutex_unlock(&mapping->i_mmap_mutex);
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
-
- return ret;
- }
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
- .fault = xip_file_fault,
- .page_mkwrite = filemap_page_mkwrite,
- .remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
- BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
- file_accessed(file);
- vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
* to get the page instead of page cache
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 21:29:23 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+/*
+ * The user has performed a load from a hole in the file. Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files. We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.

Nice trick :)

It's basically what the page cache does. Unfortunately, I had to step
out of the room while Calvin detailed his trick for doing it differently,
but if his patch goes in, we should follow suit.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ if (!page) {
+ mutex_lock(&mapping->i_mmap_mutex);
+ /* Check we didn't race with truncate */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >>
+ PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ error = -EIO;
+ goto out;
+ }
+ }

If page is non-NULL, is it possible that we return VM_FAULT_LOCKED
without actually holding i_mmap_mutex ? Is it on purpose ?

Post by Matthew Wilcox
+ return VM_FAULT_LOCKED;
+ }

That's right; this is the original meaning of VM_FAULT_LOCKED, that the
page lock is held. We took it before the call to get_block(), ensuring
that we don't hit the truncate race. Er ... hang on. At some point in
the revising of patches, I dropped the stanza where we re-check i_size
after grabbing the page lock. Sod ... a v12 of this patchset will have
to be forthcoming!

Matthew Wilcox

2014-09-25 20:33:31 UTC

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/ext2/ext2.h | 4 ++++
fs/ext2/inode.c | 2 +-
fs/ext2/namei.c | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP 0
+#endif
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 59d6c7d..cba3833 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:20:51 UTC

Post by Matthew Wilcox
Replace ext2_use_xip() with test_opt(XIP) which expands to the same code
---
fs/ext2/ext2.h | 4 ++++
fs/ext2/inode.c | 2 +-
fs/ext2/namei.c | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP 0
+#endif
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 59d6c7d..cba3833 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

Matthew Wilcox

2014-09-25 20:33:21 UTC

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL. This indicates to the MM code that
the i_mmap_mutex is held instead of the page lock.

Signed-off-by: Matthew Wilcox <***@intel.com>
Acked-by: Kirill A. Shutemov <***@linux.intel.com>
---
include/linux/mm.h | 1 +
mm/memory.c | 33 ++++++++++++++++++++++++---------
2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0a47817 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */

+ struct page *cow_page; /* Handler may choose to COW */
struct page *page; /* ->fault handlers should return a
* page here, unless VM_FAULT_NOPAGE
* is set (which is also implied by
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..3368785 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+ vmf.cow_page = NULL;

ret = vma->vm_ops->page_mkwrite(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
@@ -2698,7 +2699,8 @@ oom:
* See filemap_fault() and __lock_page_retry().
*/
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags, struct page **page)
+ pgoff_t pgoff, unsigned int flags,
+ struct page *cow_page, struct page **page)
{
struct vm_fault vmf;
int ret;
@@ -2707,10 +2709,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+ vmf.cow_page = cow_page;

ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
+ if (!vmf.page)
+ goto out;

if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
@@ -2724,6 +2729,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);

+ out:
*page = vmf.page;
return ret;
}
@@ -2897,7 +2903,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -2937,26 +2943,35 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

- copy_user_highpage(new_page, fault_page, address, vma);
+ if (fault_page)
+ copy_user_highpage(new_page, fault_page, address, vma);
__SetPageUptodate(new_page);

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
+ if (fault_page) {
+ unlock_page(fault_page);
+ page_cache_release(fault_page);
+ } else {
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
goto uncharge_out;
}
do_set_pte(vma, address, new_page, pte, true, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
+ if (fault_page) {
+ unlock_page(fault_page);
+ page_cache_release(fault_page);
+ } else {
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg);
@@ -2975,7 +2990,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 09:12:22 UTC

Post by Matthew Wilcox
Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.
The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL. This indicates to the MM code that
the i_mmap_mutex is held instead of the page lock.

Why test the value of fault_page pointer rather than just test return
flags to detect in which state the callee left i_mmap_mutex ?

Post by Matthew Wilcox
---
include/linux/mm.h | 1 +
mm/memory.c | 33 ++++++++++++++++++++++++---------
2 files changed, 25 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0a47817 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */
+ struct page *cow_page; /* Handler may choose to COW */

The page fault handler being very much performance sensitive, I'm
wondering if it would not be better to move cow_page near the end of
struct vm_fault, so that the "page" field can stay on the first
cache line.

Post by Matthew Wilcox
struct page *page; /* ->fault handlers should return a
* page here, unless VM_FAULT_NOPAGE
* is set (which is also implied by
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..3368785 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+ vmf.cow_page = NULL;

Could we add a FAULT_FLAG_COW_PAGE to vmf.flags, so we don't have to set
cow_page to NULL in the common case (when it is not used) ?

Thanks,

Mathieu

Post by Matthew Wilcox
ret = vma->vm_ops->page_mkwrite(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
* See filemap_fault() and __lock_page_retry().
*/
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags, struct page **page)
+ pgoff_t pgoff, unsigned int flags,
+ struct page *cow_page, struct page **page)
{
struct vm_fault vmf;
int ret;
@@ -2707,10 +2709,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+ vmf.cow_page = cow_page;
ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
+ if (!vmf.page)
+ goto out;
if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
@@ -2724,6 +2729,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
*page = vmf.page;
return ret;
}
@@ -2897,7 +2903,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}
- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
@@ -2937,26 +2943,35 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
}
- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
- copy_user_highpage(new_page, fault_page, address, vma);
+ if (fault_page)
+ copy_user_highpage(new_page, fault_page, address, vma);
__SetPageUptodate(new_page);
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
+ if (fault_page) {
+ unlock_page(fault_page);
+ page_cache_release(fault_page);
+ } else {
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
goto uncharge_out;
}
do_set_pte(vma, address, new_page, pte, true, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
+ if (fault_page) {
+ unlock_page(fault_page);
+ page_cache_release(fault_page);
+ } else {
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
return ret;
mem_cgroup_cancel_charge(new_page, memcg);
@@ -2975,7 +2990,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int dirtied = 0;
int ret, tmp;
- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 19:48:15 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.
The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL. This indicates to the MM code that
the i_mmap_mutex is held instead of the page lock.

Why test the value of fault_page pointer rather than just test return
flags to detect in which state the callee left i_mmap_mutex ?

Maybe my changelog isn't clear enough to a non-mm expert. Which would
include me. Usually page fault handlers return with the page lock
held and VM_FAULT_LOCKED set. This patch adds the ability to return
with VM_FAULT_LOCKED set and a NULL page. This indicates to the VM the
new possibility that the i_mmap_mutex is held instead of the page lock
(since there is no page, we cannot possibly be holding the page lock).

But we have to hold some kind of lock here, or we run the risk of a
truncate operation coming in and removing the page from the file that we
just found. The i_mmap_mutex is not ideal (since it may become heavily
contended), but it does fix the race, and some people have interesting
ideas on how to fix the scalability problem.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0a47817 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */
+ struct page *cow_page; /* Handler may choose to COW */

The page fault handler being very much performance sensitive, I'm
wondering if it would not be better to move cow_page near the end of
struct vm_fault, so that the "page" field can stay on the first
cache line.

I think your mental arithmetic has an "off by double" there:

struct vm_fault {
unsigned int flags; /* 0 4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int pgoff; /* 8 8 */
void * virtual_address; /* 16 8 */
struct page * cow_page; /* 24 8 */
struct page * page; /* 32 8 */
long unsigned int max_pgoff; /* 40 8 */
pte_t * pte; /* 48 8 */

/* size: 56, cachelines: 1, members: 7 */
/* sum members: 52, holes: 1, sum holes: 4 */
/* last cacheline: 56 bytes */
};

Post by Mathieu Desnoyers

Post by Matthew Wilcox
@@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+ vmf.cow_page = NULL;

Could we add a FAULT_FLAG_COW_PAGE to vmf.flags, so we don't have to set
cow_page to NULL in the common case (when it is not used) ?

I don't think we're short on bits, so I'm not opposed. Any MM people
want to weigh in before I make this change?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-17 15:35:01 UTC

----- Original Message -----

Sent: Thursday, October 16, 2014 9:48:15 PM
Subject: Re: [PATCH v11 04/21] mm: Allow page fault handlers to perform the COW

[...]

Post by Mathieu Desnoyers

Post by Matthew Wilcox
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0a47817 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */
+ struct page *cow_page; /* Handler may choose to COW */

The page fault handler being very much performance sensitive, I'm
wondering if it would not be better to move cow_page near the end of
struct vm_fault, so that the "page" field can stay on the first
cache line.

struct vm_fault {
unsigned int flags; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int pgoff; /* 8 8 */
void * virtual_address; /* 16 8 */
struct page * cow_page; /* 24 8 */
struct page * page; /* 32 8 */
long unsigned int max_pgoff; /* 40 8 */
pte_t * pte; /* 48 8 */
/* size: 56, cachelines: 1, members: 7 */
/* sum members: 52, holes: 1, sum holes: 4 */
/* last cacheline: 56 bytes */
};

Although it's pretty much always true that recent architectures L2 cache
lines are 64 bytes, I was more thinking about L1 cache lines, which are,
at least on moderately old Intel Pentium HW, 32 bytes in size (AFAIK
Pentium II and III).

It remains to be seen whether we care about performance that much on this
kind of HW though.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
@@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct
*vma, struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+ vmf.cow_page = NULL;

Could we add a FAULT_FLAG_COW_PAGE to vmf.flags, so we don't have to set
cow_page to NULL in the common case (when it is not used) ?

I don't think we're short on bits, so I'm not opposed. Any MM people
want to weigh in before I make this change?

Well since new HW seem to have standardized on 64-bytes L1 cache lines
(recent Intel and ARM Cortex A7 and A15), perhaps it's not worth it. However
I'd be curious if there are other architectures out there we care about
performance-wise that still have 32-byte cache lines.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2014-10-18 17:22:07 UTC

Post by Mathieu Desnoyers

Post by Mathieu Desnoyers
The page fault handler being very much performance sensitive, I'm
wondering if it would not be better to move cow_page near the end of
struct vm_fault, so that the "page" field can stay on the first
cache line.

Although it's pretty much always true that recent architectures L2 cache
lines are 64 bytes, I was more thinking about L1 cache lines, which are,
at least on moderately old Intel Pentium HW, 32 bytes in size (AFAIK
Pentium II and III).
It remains to be seen whether we care about performance that much on this
kind of HW though.

Oh, I just remembered ... this data structure is on the stack, so if it's
not cache-hot, something has gone horribly wrong.

Matthew Wilcox

2014-09-25 20:33:19 UTC

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
Reviewed-by: Boaz Harrosh <***@plexistor.com>
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 17 ++++-------------
drivers/block/brd.c | 12 +++++-------
drivers/s390/block/dcssblk.c | 21 +++++++++-----------
fs/block_dev.c | 40 +++++++++++++++++++++++++++++++++++++++
fs/ext2/xip.c | 31 +++++++++++++-----------------
include/linux/blkdev.h | 6 ++++--
7 files changed, 84 insertions(+), 58 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.

-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.

The block device operation is optional, these block devices support it as of
today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
- loff_t offset;
-
- offset = sector;
- if (device->bd_part != NULL)
- offset += device->bd_part->start_sect;
- offset <<= AXON_RAM_SECTOR_SHIFT;
- if (offset >= bank->size) {
- dev_err(&bank->device->dev, "Access outside of address space\n");
- return -ERANGE;
- }
+ loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return bank->size - offset;
}

static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3598110..78fe510 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,23 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}

#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;

if (!brd)
return -ENODEV;
- if (sector & (PAGE_SECTORS-1))
- return -EINVAL;
- if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
- return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- return 0;
+ /* If size > PAGE_SIZE, we could look to see if the next page in the
+ * file happens to be mapped to the next page of physical RAM */
+ return PAGE_SIZE;
}
#endif

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -866,25 +866,22 @@ fail:
bio_io_error(bio);
}

-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;

dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
- if (secnum % (PAGE_SIZE/512))
- return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
- return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ dev_sz = dev_info->end - dev_info->start;
+ offset = secnum * 512;
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return dev_sz - offset;
}

static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..ffe0761 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -427,6 +427,46 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(bdev_write_page);

+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ * @bdev: The device containing the memory
+ * @sector: The offset within the device
+ * @addr: Where to put the address of the memory
+ * @pfn: The Page Frame Number for the memory
+ * @size: The number of bytes requested
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory. The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar. The PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+ void **addr, unsigned long *pfn, long size)
+{
+ long avail;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+ if (size < 0)
+ return size;
+ if (!ops->direct_access)
+ return -EOPNOTSUPP;
+ if ((sector + DIV_ROUND_UP(size, 512)) >
+ part_nr_sects_read(bdev->bd_part))
+ return -ERANGE;
+ sector += get_start_sect(bdev);
+ if (sector % (PAGE_SIZE / 512))
+ return -EINVAL;
+ avail = ops->direct_access(bdev, sector, addr, pfn, size);
+ if (!avail)
+ return -ERANGE;
+ return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
/*
* pseudo-fs
*/
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
#include "ext2.h"
#include "xip.h"

-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return bdev_direct_access(bdev, sector, kaddr, pfn, size);
}

static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;

- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}

void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;

/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 518b465..ac25166 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,8 +1615,8 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access)(struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1634,6 +1634,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+ unsigned long *pfn, long size);
#else /* CONFIG_BLOCK */

struct block_device;

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 08:45:50 UTC

Post by Matthew Wilcox
In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.
Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 17 ++++-------------
drivers/block/brd.c | 12 +++++-------
drivers/s390/block/dcssblk.c | 21 +++++++++-----------
fs/block_dev.c | 40 +++++++++++++++++++++++++++++++++++++++
fs/ext2/xip.c | 31 +++++++++++++-----------------
include/linux/blkdev.h | 6 ++++--
7 files changed, 84 insertions(+), 58 deletions(-)
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.
The block device operation is optional, these block devices support it as of
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)

Why "long" as type for size ? What is the intent to have it signed, and
why using a 32-bit type on 32-bit architectures rather than 64-bit ?
Can we run into issues if we try to map a >2GB file on 32-bit
architectures ?

Post by Matthew Wilcox
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
- loff_t offset;
-
- offset = sector;
- if (device->bd_part != NULL)
- offset += device->bd_part->start_sect;
- offset <<= AXON_RAM_SECTOR_SHIFT;
- if (offset >= bank->size) {
- dev_err(&bank->device->dev, "Access outside of address space\n");
- return -ERANGE;
- }
+ loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
- return 0;
+ return bank->size - offset;
}
static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3598110..78fe510 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,23 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}
#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
if (!brd)
return -ENODEV;
- if (sector & (PAGE_SECTORS-1))
- return -EINVAL;
- if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
- return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn(page);
- return 0;
+ /* If size > PAGE_SIZE, we could look to see if the next page in the
+ * file happens to be mapped to the next page of physical RAM */

The style for this comment should be:

/*
* ....
*/

Perhaps with a "TODO" ?

Post by Matthew Wilcox
+ return PAGE_SIZE;
}
#endif
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);
static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
bio_io_error(bio);
}
-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;
dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
- if (secnum % (PAGE_SIZE/512))
- return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
- return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ dev_sz = dev_info->end - dev_info->start;
+ offset = secnum * 512;
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
- return 0;
+ return dev_sz - offset;
}
static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..ffe0761 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -427,6 +427,46 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(bdev_write_page);
+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory. The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar. The PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+ void **addr, unsigned long *pfn, long size)
+{
+ long avail;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+ if (size < 0)
+ return size;

I'm wondering how we should handle size == 0 here. Should it be accepted
or refused ?

Thanks,

Mathieu

Post by Matthew Wilcox
+ if (!ops->direct_access)
+ return -EOPNOTSUPP;
+ if ((sector + DIV_ROUND_UP(size, 512)) >
+ part_nr_sects_read(bdev->bd_part))
+ return -ERANGE;
+ sector += get_start_sect(bdev);
+ if (sector % (PAGE_SIZE / 512))
+ return -EINVAL;
+ avail = ops->direct_access(bdev, sector, addr, pfn, size);
+ if (!avail)
+ return -ERANGE;
+ return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
/*
* pseudo-fs
*/
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
#include "ext2.h"
#include "xip.h"
-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return bdev_direct_access(bdev, sector, kaddr, pfn, size);
}
static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;
- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}
void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;
/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;
/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 518b465..ac25166 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,8 +1615,8 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access)(struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1634,6 +1634,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+ unsigned long *pfn, long size);
#else /* CONFIG_BLOCK */
struct block_device;
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 19:39:21 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)

Why "long" as type for size ? What is the intent to have it signed, and
why using a 32-bit type on 32-bit architectures rather than 64-bit ?
Can we run into issues if we try to map a >2GB file on 32-bit
architectures ?

The interface requires that the entirety of the pmem be mapped at
all times (see the void **kaddr). So the total amount of pmem in the
system can't be larger than 4GB on a 32-bit system. On x86-32, that's
actually limited to 1GB (because we give userspace 3GB), so the problem
doesn't come up. Maybe this would be more of a potetial problem on
other architectures.

As noted elsewhere in the thread, it would be possible, and maybe
desirable, to remove the need to have all of pmem mapped into the kernel
address space at all times, but I'm not looking to solve that problem
with this patch-set.

The intent of having it signed is that users pass in the size they want
to have and are returned the size they actually got. Since the function
must be able to return an error, keeping size signed is natural.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+ void **addr, unsigned long *pfn, long size)
+{
+ long avail;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+ if (size < 0)
+ return size;

I'm wondering how we should handle size == 0 here. Should it be accepted
or refused ?

It is a bit of a bizarre case. I'm inclined to the current behaviour
of saying "this is the address where you can access zero bytes" :-)

But maybe it indicates a bug in the caller, and being noisy about it
would result in the caller getting fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:36 UTC

This new function allows us to support hole-punch for DAX files by zeroing
a partial page, as opposed to the dax_truncate_page() function which can
only truncate to the end of the page. Reimplement dax_truncate_page() to
call dax_zero_page_range().

Signed-off-by: Matthew Wilcox <***@intel.com>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <***@linux.intel.com>
---
Documentation/filesystems/dax.txt | 1 +
fs/dax.c | 36 +++++++++++++++++++++++++++++++-----
include/linux/fs.h | 7 +++++++
3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 635adaa..ebcd97f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,6 +62,7 @@ Filesystem support consists of
for fault and page_mkwrite (which should probably call dax_fault() and
dax_mkwrite(), passing the appropriate get_block() callback)
- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
- ensuring that there is sufficient locking between reads, writes,
truncates and page faults

diff --git a/fs/dax.c b/fs/dax.c
index 6801be7..91b7561 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -462,13 +462,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
EXPORT_SYMBOL_GPL(dax_fault);

/**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
* @inode: The file being truncated
* @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
* @get_block: The filesystem method used to translate file offsets to blocks
*
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file. This is intended for hole-punch operations. If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
*
* We work in terms of PAGE_CACHE_SIZE here for commonality with
* block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -476,17 +479,18 @@ EXPORT_SYMBOL_GPL(dax_fault);
* block size is smaller than PAGE_SIZE, we have to zero the rest of the page
* since the file might be mmaped.
*/
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+ get_block_t get_block)
{
struct buffer_head bh;
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length = PAGE_CACHE_ALIGN(from) - from;
int err;

/* Block boundary? Nothing to do */
if (!length)
return 0;
+ BUG_ON((offset + length) > PAGE_CACHE_SIZE);

memset(&bh, 0, sizeof(bh));
bh.b_size = PAGE_CACHE_SIZE;
@@ -503,4 +507,26 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)

return 0;
}
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;
+ return dax_zero_page_range(inode, from, length, get_block);
+}
EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e6b48cc..105d0f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,6 +2490,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
@@ -2506,6 +2507,12 @@ static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
return 0;
}

+static inline int dax_zero_page_range(struct inode *i, loff_t frm,
+ unsigned len, get_block_t gb)
+{
+ return 0;
+}
+
static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
struct inode *inode, struct iov_iter *iter, loff_t pos,
get_block_t get_block, dio_iodone_t end_io, int flags)

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:38:24 UTC

Post by Matthew Wilcox
This new function allows us to support hole-punch for DAX files by zeroing
a partial page, as opposed to the dax_truncate_page() function which can
only truncate to the end of the page. Reimplement dax_truncate_page() to
call dax_zero_page_range().
[ported to 3.13-rc2]
---
Documentation/filesystems/dax.txt | 1 +
fs/dax.c | 36 +++++++++++++++++++++++++++++++-----
include/linux/fs.h | 7 +++++++
3 files changed, 39 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 635adaa..ebcd97f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,6 +62,7 @@ Filesystem support consists of
for fault and page_mkwrite (which should probably call dax_fault() and
dax_mkwrite(), passing the appropriate get_block() callback)
- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
- ensuring that there is sufficient locking between reads, writes,
truncates and page faults
diff --git a/fs/dax.c b/fs/dax.c
index 6801be7..91b7561 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -462,13 +462,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
EXPORT_SYMBOL_GPL(dax_fault);
/**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
*
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file. This is intended for hole-punch operations. If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
*
* We work in terms of PAGE_CACHE_SIZE here for commonality with
* block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -476,17 +479,18 @@ EXPORT_SYMBOL_GPL(dax_fault);
* block size is smaller than PAGE_SIZE, we have to zero the rest of the page
* since the file might be mmaped.
*/
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,

nit: unsigned -> unsigned int ?

Do we want a unsigned int or unsigned long here ?

Post by Matthew Wilcox
+ get_block_t get_block)
{
struct buffer_head bh;
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length = PAGE_CACHE_ALIGN(from) - from;
int err;
/* Block boundary? Nothing to do */
if (!length)
return 0;
+ BUG_ON((offset + length) > PAGE_CACHE_SIZE);

Isn't it a bit extreme to BUG_ON this condition ? We could return an
error to the caller, and perhaps WARN_ON_ONCE(), but BUG_ON() appears to
be slightly too strong here.

Post by Matthew Wilcox
memset(&bh, 0, sizeof(bh));
bh.b_size = PAGE_CACHE_SIZE;
@@ -503,4 +507,26 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
return 0;
}
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.

an DAX -> a DAX

Post by Matthew Wilcox
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;

unsigned -> unsigned int.

Same question about "unsigned long" as above.

Post by Matthew Wilcox
+ return dax_zero_page_range(inode, from, length, get_block);
+}
EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e6b48cc..105d0f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,6 +2490,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
@@ -2506,6 +2507,12 @@ static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
return 0;
}
+static inline int dax_zero_page_range(struct inode *i, loff_t frm,
+ unsigned len, get_block_t gb)
+{
+ return 0;

Should we return 0 or -ENOSYS here ?

Thanks,

Mathieu

Post by Matthew Wilcox
+}
+
static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
struct inode *inode, struct iov_iter *iter, loff_t pos,
get_block_t get_block, dio_iodone_t end_io, int flags)
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 22:01:26 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,

nit: unsigned -> unsigned int ?
Do we want a unsigned int or unsigned long here ?

It's supposed to be for a fragment of a page, so until we see a machine
with PAGE_SIZE > 4GB, we're good to use an unsigned int.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
if (!length)
return 0;
+ BUG_ON((offset + length) > PAGE_CACHE_SIZE);

Isn't it a bit extreme to BUG_ON this condition ? We could return an
error to the caller, and perhaps WARN_ON_ONCE(), but BUG_ON() appears to
be slightly too strong here.

Dave Chinner asked for it :-) The filesystem is supposed to be doing
this clamping (until the last version, I had this function doing the
clamping, and I was told off for "leaving landmines lying around".

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static inline int dax_zero_page_range(struct inode *i, loff_t frm,
+ unsigned len, get_block_t gb)
+{
+ return 0;

Should we return 0 or -ENOSYS here ?

I kind of wonder if we shouldn't just declare the function. It's called
like this:

if (IS_DAX(inode))
return dax_zero_page_range(inode, from, length, ext4_get_block);
return __ext4_block_zero_page_range(handle, mapping, from, length);

and if CONFIG_DAX is not set, IS_DAX evaluates to 0 at compile time, so
the compiler will optimise out the call to dax_zero_page_range() anyway.

Mathieu Desnoyers

2014-10-17 15:49:39 UTC

----- Original Message -----

Sent: Friday, October 17, 2014 12:01:26 AM
Subject: Re: [PATCH v11 19/21] dax: Add dax_zero_page_range

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,

nit: unsigned -> unsigned int ?
Do we want a unsigned int or unsigned long here ?

It's supposed to be for a fragment of a page, so until we see a machine
with PAGE_SIZE > 4GB, we're good to use an unsigned int.

OK

Post by Mathieu Desnoyers

Post by Matthew Wilcox
if (!length)
return 0;
+ BUG_ON((offset + length) > PAGE_CACHE_SIZE);

Isn't it a bit extreme to BUG_ON this condition ? We could return an
error to the caller, and perhaps WARN_ON_ONCE(), but BUG_ON() appears to
be slightly too strong here.

Dave Chinner asked for it :-) The filesystem is supposed to be doing
this clamping (until the last version, I had this function doing the
clamping, and I was told off for "leaving landmines lying around".

Makes sense,

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static inline int dax_zero_page_range(struct inode *i, loff_t frm,
+ unsigned len, get_block_t gb)
+{
+ return 0;

Should we return 0 or -ENOSYS here ?

I kind of wonder if we shouldn't just declare the function. It's called
if (IS_DAX(inode))
return dax_zero_page_range(inode, from, length,
ext4_get_block);
return __ext4_block_zero_page_range(handle, mapping, from, length);
and if CONFIG_DAX is not set, IS_DAX evaluates to 0 at compile time, so
the compiler will optimise out the call to dax_zero_page_range() anyway.

I strongly prefer to implement "unimplemented stub" as static inlines
rather than defining to 0, because the compiler can check that the types
passed to the function are valid, even in the #else configuration which
uses the stubs.

The only reason why I have not pointed this out for some of your other
patches was because it was clear that the local style of those files was
to define stubbed functions as 0. But I still dislike it.

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-18 17:41:00 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
I kind of wonder if we shouldn't just declare the function. It's called
if (IS_DAX(inode))
return dax_zero_page_range(inode, from, length,
ext4_get_block);
return __ext4_block_zero_page_range(handle, mapping, from, length);
and if CONFIG_DAX is not set, IS_DAX evaluates to 0 at compile time, so
the compiler will optimise out the call to dax_zero_page_range() anyway.

I strongly prefer to implement "unimplemented stub" as static inlines
rather than defining to 0, because the compiler can check that the types
passed to the function are valid, even in the #else configuration which
uses the stubs.

I think my explanation was unclear. This is what I meant:

+++ b/include/linux/fs.h
@@ -2473,7 +2473,6 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t
offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

-#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t)
;
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
#define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)
-#else
-static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
-{
- return 0;
-}
-
-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
-{
- return 0;
-}
-
-static inline int dax_zero_page_range(struct inode *i, loff_t frm,
- unsigned len, get_block_t gb)
-{
- return 0;
-}
-
-static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
- struct inode *inode, struct iov_iter *iter, loff_t pos,
- get_block_t get_block, dio_iodone_t end_io, int flags)
-{
- return -ENOTTY;
-}
-#endif

#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,

So after the preprocessor has run, the compiler will see:

if (0)
return dax_zero_page_range(inode, from, length, ext4_get_block);

and it will still do type checking on the call, even though it will eliminate
the call.

I think what you're really complaining about is that the argument to
IS_DAX() is not checked for being an inode.

We could solve that this way:

#ifdef CONFIG_FS_DAX
#define S_DAX 8192
#else
#define S_DAX 0
#endif
...
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)

After preprocessing, the compiler than sees:

if (((inode)->i_flags & 0))
return dax_zero_page_range(inode, from, length, ext4_get_block);

and successfully deduces that the condition evaluates to 0, and still
elide the reference to dax_zero_page_range (checked with 'nm').

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-18 21:16:23 UTC

----- Original Message -----

Sent: Saturday, October 18, 2014 7:41:00 PM
Subject: Re: [PATCH v11 19/21] dax: Add dax_zero_page_range

Post by Mathieu Desnoyers

Post by Matthew Wilcox
I kind of wonder if we shouldn't just declare the function. It's called
if (IS_DAX(inode))
return dax_zero_page_range(inode, from, length,
ext4_get_block);
return __ext4_block_zero_page_range(handle, mapping, from, length);
and if CONFIG_DAX is not set, IS_DAX evaluates to 0 at compile time, so
the compiler will optimise out the call to dax_zero_page_range() anyway.

I strongly prefer to implement "unimplemented stub" as static inlines
rather than defining to 0, because the compiler can check that the types
passed to the function are valid, even in the #else configuration which
uses the stubs.

+++ b/include/linux/fs.h
@@ -2473,7 +2473,6 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t
offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);
-#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t)
;
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
#define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)
-#else
-static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
-{
- return 0;
-}
-
-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
-{
- return 0;
-}
-
-static inline int dax_zero_page_range(struct inode *i, loff_t frm,
- unsigned len, get_block_t gb)
-{
- return 0;
-}
-
-static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
- struct inode *inode, struct iov_iter *iter, loff_t pos,
- get_block_t get_block, dio_iodone_t end_io, int flags)
-{
- return -ENOTTY;
-}
-#endif
#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
if (0)
return dax_zero_page_range(inode, from, length, ext4_get_block);
and it will still do type checking on the call, even though it will eliminate
the call.

Indeed, since Linux is always compiled in O2 or Os, it will work.

I think what you're really complaining about is that the argument to
IS_DAX() is not checked for being an inode.
#ifdef CONFIG_FS_DAX
#define S_DAX 8192
#else
#define S_DAX 0
#endif
...
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
if (((inode)->i_flags & 0))
return dax_zero_page_range(inode, from, length, ext4_get_block);
and successfully deduces that the condition evaluates to 0, and still
elide the reference to dax_zero_page_range (checked with 'nm').

Sounds good,

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:20 UTC

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
Acked-by: Kirill A. Shutemov <***@linux.intel.com>
---
mm/filemap_xip.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
__xip_unmap(mapping, vmf->pgoff);

found:
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
@@ -285,16 +294,27 @@ found:
}
if (error != -ENODATA)
goto out;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ ret = VM_FAULT_SIGBUS;
+ goto unlock;
+ }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
- goto out;
+ goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
- goto out;
+ goto unlock;

ret = VM_FAULT_NOPAGE;
+unlock:
+ mutex_unlock(&mapping->i_mmap_mutex);
out:
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);

--
2.1.0

Mathieu Desnoyers

2014-10-16 08:56:42 UTC

Post by Matthew Wilcox
Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.
There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.
---
mm/filemap_xip.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
__xip_unmap(mapping, vmf->pgoff);
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
}
if (error != -ENODATA)
goto out;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ ret = VM_FAULT_SIGBUS;
+ goto unlock;
+ }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
- goto out;
+ goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
- goto out;
+ goto unlock;
ret = VM_FAULT_NOPAGE;
+ mutex_unlock(&mapping->i_mmap_mutex);
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
--
2.1.0

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:30 UTC

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed. It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/ext2/super.c | 33 ++++++++++++---------------------
fs/ext2/xip.c | 12 ------------
fs/ext2/xip.h | 2 --
3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index b88edc0..d862031 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
- if (!silent)
+ if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
- goto failed_mount;
+ "error: unsupported blocksize for xip");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext2_msg(sb, KERN_ERR,
+ "error: device does not support xip");
+ goto failed_mount;
+ }
}

/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
{
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
- unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext2_mount_options old_opts;
unsigned long old_sb_flags;
int err;
@@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
- if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
- ext2_msg(sb, KERN_WARNING,
- "warning: unsupported blocksize for xip");
- err = -EINVAL;
- goto restore_opts;
- }
-
es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
"xip flag with busy inodes while remounting");
- sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
- sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+ sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
#include "ext2.h"
#include "xip.h"

-void ext2_xip_verify_sb(struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
-
- if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
- !sb->s_bdev->bd_disk->fops->direct_access) {
- sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
- ext2_msg(sb, KERN_WARNING,
- "warning: ignoring xip option - "
- "not supported by bdev");
- }
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
*/

#ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
#else
-#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#endif

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:18:02 UTC

Post by Matthew Wilcox
Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed. It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

By "doesn't make sense", do you mean it is never actually used, or that
it is possible for a current user to trigger issues by changing XIP
option on remount ? If it is the case, then this patch should probably
be flagged as a "Fix".

Thanks,

Mathieu

Post by Matthew Wilcox
Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.
---
fs/ext2/super.c | 33 ++++++++++++---------------------
fs/ext2/xip.c | 12 ------------
fs/ext2/xip.h | 2 --
3 files changed, 12 insertions(+), 35 deletions(-)
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index b88edc0..d862031 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);
- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
- if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
- if (!silent)
+ if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
- goto failed_mount;
+ "error: unsupported blocksize for xip");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext2_msg(sb, KERN_ERR,
+ "error: device does not support xip");
+ goto failed_mount;
+ }
}
/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
{
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
- unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext2_mount_options old_opts;
unsigned long old_sb_flags;
int err;
@@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
- if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
- ext2_msg(sb, KERN_WARNING,
- "warning: unsupported blocksize for xip");
- err = -EINVAL;
- goto restore_opts;
- }
-
es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
"xip flag with busy inodes while remounting");
- sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
- sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+ sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
#include "ext2.h"
#include "xip.h"
-void ext2_xip_verify_sb(struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
-
- if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
- !sb->s_bdev->bd_disk->fops->direct_access) {
- sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
- ext2_msg(sb, KERN_WARNING,
- "warning: ignoring xip option - "
- "not supported by bdev");
- }
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
*/
#ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
#else
-#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#endif
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 21:45:07 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed. It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

By "doesn't make sense", do you mean it is never actually used, or that
it is possible for a current user to trigger issues by changing XIP
option on remount ? If it is the case, then this patch should probably
be flagged as a "Fix".

I mean that we're checking for a condition that can't actually happen,
so it's safe to just delete the check.

Matthew Wilcox

2014-09-25 20:33:37 UTC

From: Ross Zwisler <***@linux.intel.com>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <***@linux.intel.com>
Reviewed-by: Andreas Dilger <***@intel.com>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <***@intel.com>
---
Documentation/filesystems/dax.txt | 1 +
Documentation/filesystems/ext4.txt | 2 +
fs/ext4/ext4.h | 6 +++
fs/ext4/file.c | 49 ++++++++++++++++++++-
fs/ext4/indirect.c | 18 +++++---
fs/ext4/inode.c | 89 ++++++++++++++++++++++++++------------
fs/ext4/namei.c | 10 ++++-
fs/ext4/super.c | 39 ++++++++++++++++-
8 files changed, 177 insertions(+), 37 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index ebcd97f..be376d9 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -73,6 +73,7 @@ or a write()) work correctly.

These filesystems may be used for inspiration:
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt

Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.

+dax Use direct access if possible
+
Data Mode
=========
There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0c225c..5b38569 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -969,6 +969,11 @@ struct ext4_inode_info {
#define EXT4_MOUNT_ERRORS_MASK 0x00070
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */
+#else
+#define EXT4_MOUNT_DAX 0
+#endif
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
@@ -2574,6 +2579,7 @@ extern const struct file_operations ext4_dir_operations;
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);

/* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..9c7bde5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
struct mutex *aio_mutex = NULL;
struct blk_plug plug;
- int o_direct = file->f_flags & O_DIRECT;
+ int o_direct = io_is_direct(file);
int overwrite = 0;
size_t length = iov_iter_count(from);
ssize_t ret;
@@ -191,6 +191,27 @@ errout:
return ret;
}

+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext4_get_block);
+ /* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+ .fault = ext4_dax_fault,
+ .page_mkwrite = ext4_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops ext4_file_vm_ops
+#endif
+
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
@@ -201,7 +222,12 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
file_accessed(file);
- vma->vm_ops = &ext4_file_vm_ops;
+ if (IS_DAX(file_inode(file))) {
+ vma->vm_ops = &ext4_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ } else {
+ vma->vm_ops = &ext4_file_vm_ops;
+ }
return 0;
}

@@ -600,6 +626,25 @@ const struct file_operations ext4_file_operations = {
.fallocate = ext4_fallocate,
};

+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+ .llseek = ext4_llseek,
+ .read = new_sync_read,
+ .write = new_sync_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = ext4_file_write_iter,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = ext4_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,
+};
+#endif
+
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..fa9ec8d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -691,14 +691,22 @@ retry:
inode_dio_done(inode);
goto locked;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iter, offset,
- ext4_get_block, NULL, NULL, 0);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset,
+ ext4_get_block, NULL, 0);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
locked:
- ret = blockdev_direct_IO(rw, iocb, inode, iter,
- offset, ext4_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset,
+ ext4_get_block, NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iter,
+ offset, ext4_get_block);

if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..542205f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -676,6 +676,18 @@ has_zeroout:
return retval;
}

+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+ struct inode *inode = bh->b_assoc_map->host;
+ /* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+ loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+ int err;
+ if (!uptodate)
+ return;
+ WARN_ON(!buffer_unwritten(bh));
+ err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096

@@ -713,6 +725,11 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,

map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+ if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+ bh->b_assoc_map = inode->i_mapping;
+ bh->b_private = (void *)(unsigned long)iblock;
+ bh->b_end_io = ext4_end_io_unwritten;
+ }
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3043,13 +3060,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iter,
- offset,
- get_block_func,
- ext4_end_io_dio,
- NULL,
- dio_flags);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset, get_block_func,
+ ext4_end_io_dio, dio_flags);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);

/*
* Put our reference to io_end. This can free the io_end structure e.g.
@@ -3213,19 +3231,12 @@ void ext4_set_aops(struct inode *inode)
inode->i_mapping->a_ops = &ext4_aops;
}

-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'. The range to be zero'd must
- * be contained with in one block. If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize, max, pos;
+ unsigned blocksize, pos;
ext4_lblk_t iblock;
struct inode *inode = mapping->host;
struct buffer_head *bh;
@@ -3238,14 +3249,6 @@ static int ext4_block_zero_page_range(handle_t *handle,
return -ENOMEM;

blocksize = inode->i_sb->s_blocksize;
- max = blocksize - (offset & (blocksize - 1));
-
- /*
- * correct length if it does not fall between
- * 'from' and the end of the block
- */
- if (length > max || length < 0)
- length = max;

iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);

@@ -3311,6 +3314,33 @@ unlock:
}

/*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'. The range to be zero'd must
+ * be contained with in one block. If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+ struct address_space *mapping, loff_t from, loff_t length)
+{
+ struct inode *inode = mapping->host;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned blocksize = inode->i_sb->s_blocksize;
+ unsigned max = blocksize - (offset & (blocksize - 1));
+
+ /*
+ * correct length if it does not fall between
+ * 'from' and the end of the block
+ */
+ if (length > max || length < 0)
+ length = max;
+
+ if (IS_DAX(inode))
+ return dax_zero_page_range(inode, from, length, ext4_get_block);
+ return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
* ext4_block_truncate_page() zeroes out a mapping from file offset `from'
* up to the end of the block which corresponds to `from'.
* This required during truncate. We need to physically zero the tail end
@@ -3831,8 +3861,10 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, DAX))
+ new_fl |= S_DAX;
inode_set_flags(inode, new_fl,
- S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
}

/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4086,7 +4118,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
@@ -4556,7 +4591,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
*/
- truncate_pagecache(inode, inode->i_size);
+ truncate_pagecache(inode, inode->i_size);
}
/*
* We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 603e4eb..8d744a5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2264,7 +2264,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
err = ext4_add_nondir(handle, dentry, inode);
if (!err && IS_DIRSYNC(dir))
@@ -2328,7 +2331,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
d_tmpfile(dentry, inode);
err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36..b94b6b9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1162,7 +1162,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
- Opt_usrquota, Opt_grpquota, Opt_i_version,
+ Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1224,6 +1224,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_nobarrier, "nobarrier"},
{Opt_i_version, "i_version"},
+ {Opt_dax, "dax"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
@@ -1406,6 +1407,7 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
+ {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -1642,6 +1644,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
+#ifndef CONFIG_FS_DAX
+ } else if (token == Opt_dax) {
+ ext4_msg(sb, KERN_INFO, "dax option not supported");
+ return -1;
+#endif
} else {
if (!args->from)
arg = 1;
@@ -3572,6 +3579,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ goto failed_mount;
+ }
if (test_opt(sb, DELALLOC))
clear_opt(sb, DELALLOC);
}
@@ -3635,6 +3647,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount;
}

+ if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+ if (blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "error: unsupported blocksize for dax");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext4_msg(sb, KERN_ERR,
+ "error: device does not support dax");
+ goto failed_mount;
+ }
+ }
+
if (sb->s_blocksize != blocksize) {
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
@@ -4837,6 +4862,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
err = -EINVAL;
goto restore_opts;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ err = -EINVAL;
+ goto restore_opts;
+ }
+ }
+
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+ ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
}

if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:56:25 UTC

Post by Matthew Wilcox
This is a port of the DAX functionality found in the current version of
ext2.
[heavily tweaked]
---
Documentation/filesystems/dax.txt | 1 +
Documentation/filesystems/ext4.txt | 2 +
fs/ext4/ext4.h | 6 +++
fs/ext4/file.c | 49 ++++++++++++++++++++-
fs/ext4/indirect.c | 18 +++++---
fs/ext4/inode.c | 89 ++++++++++++++++++++++++++------------
fs/ext4/namei.c | 10 ++++-
fs/ext4/super.c | 39 ++++++++++++++++-
8 files changed, 177 insertions(+), 37 deletions(-)
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index ebcd97f..be376d9 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -73,6 +73,7 @@ or a write()) work correctly.
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.
+dax Use direct access if possible
+
Data Mode
=========
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0c225c..5b38569 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -969,6 +969,11 @@ struct ext4_inode_info {
#define EXT4_MOUNT_ERRORS_MASK 0x00070
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */

Execute in place -> Direct Access stuff... (comment above)

Post by Matthew Wilcox
+#else
+#define EXT4_MOUNT_DAX 0
+#endif
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
@@ -2574,6 +2579,7 @@ extern const struct file_operations ext4_dir_operations;
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
/* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..9c7bde5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
struct mutex *aio_mutex = NULL;
struct blk_plug plug;
- int o_direct = file->f_flags & O_DIRECT;
+ int o_direct = io_is_direct(file);
int overwrite = 0;
size_t length = iov_iter_count(from);
ssize_t ret;
return ret;
}
+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext4_get_block);
+ /* Is this the right get_block? */

perhaps this needs a TODO or FIXME or XXX to make sure an ext4
maintainer does not miss this question.

Post by Matthew Wilcox
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+ .fault = ext4_dax_fault,
+ .page_mkwrite = ext4_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops ext4_file_vm_ops
+#endif
+
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
@@ -201,7 +222,12 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
file_accessed(file);
- vma->vm_ops = &ext4_file_vm_ops;
+ if (IS_DAX(file_inode(file))) {
+ vma->vm_ops = &ext4_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ } else {
+ vma->vm_ops = &ext4_file_vm_ops;
+ }
return 0;
}
@@ -600,6 +626,25 @@ const struct file_operations ext4_file_operations = {
.fallocate = ext4_fallocate,
};
+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+ .llseek = ext4_llseek,
+ .read = new_sync_read,
+ .write = new_sync_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = ext4_file_write_iter,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = ext4_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,

Perhaps adding comments saying that .splice_read and .splice_write are
unavailable here would help understanding why we need a different file
operations structure.

Post by Matthew Wilcox
+};
+#endif
+
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..fa9ec8d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
inode_dio_done(inode);
goto locked;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iter, offset,
- ext4_get_block, NULL, NULL, 0);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset,
+ ext4_get_block, NULL, 0);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
- ret = blockdev_direct_IO(rw, iocb, inode, iter,
- offset, ext4_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset,
+ ext4_get_block, NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iter,
+ offset, ext4_get_block);
if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..542205f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
return retval;
}
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+ struct inode *inode = bh->b_assoc_map->host;
+ /* XXX: breaks on 32-bit > 16GB. Is that even supported? */

Good question! It would be interesting to get an answer :)

Post by Matthew Wilcox
+ loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+ int err;

missing newline.

Post by Matthew Wilcox
+ if (!uptodate)
+ return;
+ WARN_ON(!buffer_unwritten(bh));
+ err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);

err is simply unused here, that does not look good (silent failure).

Post by Matthew Wilcox
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
@@ -713,6 +725,11 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+ if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+ bh->b_assoc_map = inode->i_mapping;
+ bh->b_private = (void *)(unsigned long)iblock;
+ bh->b_end_io = ext4_end_io_unwritten;
+ }
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3043,13 +3060,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iter,
- offset,
- get_block_func,
- ext4_end_io_dio,
- NULL,
- dio_flags);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset, get_block_func,
+ ext4_end_io_dio, dio_flags);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);
/*
* Put our reference to io_end. This can free the io_end structure e.g.
@@ -3213,19 +3231,12 @@ void ext4_set_aops(struct inode *inode)
inode->i_mapping->a_ops = &ext4_aops;
}
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'. The range to be zero'd must
- * be contained with in one block. If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize, max, pos;
+ unsigned blocksize, pos;
ext4_lblk_t iblock;
struct inode *inode = mapping->host;
struct buffer_head *bh;
@@ -3238,14 +3249,6 @@ static int ext4_block_zero_page_range(handle_t *handle,
return -ENOMEM;
blocksize = inode->i_sb->s_blocksize;
- max = blocksize - (offset & (blocksize - 1));
-
- /*
- * correct length if it does not fall between
- * 'from' and the end of the block
- */
- if (length > max || length < 0)
- length = max;
iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
}
/*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'. The range to be zero'd must
+ * be contained with in one block. If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+ struct address_space *mapping, loff_t from, loff_t length)
+{
+ struct inode *inode = mapping->host;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned blocksize = inode->i_sb->s_blocksize;
+ unsigned max = blocksize - (offset & (blocksize - 1));
+
+ /*
+ * correct length if it does not fall between
+ * 'from' and the end of the block
+ */

Shouldn't a length < 0 be treated as an error instead ?

Post by Matthew Wilcox
+ if (length > max || length < 0)
+ length = max;
+
+ if (IS_DAX(inode))
+ return dax_zero_page_range(inode, from, length, ext4_get_block);
+ return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
* ext4_block_truncate_page() zeroes out a mapping from file offset `from'
* up to the end of the block which corresponds to `from'.
* This required during truncate. We need to physically zero the tail end
@@ -3831,8 +3861,10 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, DAX))
+ new_fl |= S_DAX;
inode_set_flags(inode, new_fl,
- S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
}
/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4086,7 +4118,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
@@ -4556,7 +4591,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
*/
- truncate_pagecache(inode, inode->i_size);
+ truncate_pagecache(inode, inode->i_size);
}
/*
* We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 603e4eb..8d744a5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
err = ext4_add_nondir(handle, dentry, inode);
if (!err && IS_DIRSYNC(dir))
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
d_tmpfile(dentry, inode);
err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36..b94b6b9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1162,7 +1162,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
- Opt_usrquota, Opt_grpquota, Opt_i_version,
+ Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1224,6 +1224,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_nobarrier, "nobarrier"},
{Opt_i_version, "i_version"},
+ {Opt_dax, "dax"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
@@ -1406,6 +1407,7 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
+ {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -1642,6 +1644,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
+#ifndef CONFIG_FS_DAX
+ } else if (token == Opt_dax) {
+ ext4_msg(sb, KERN_INFO, "dax option not supported");
+ return -1;
+#endif
} else {
if (!args->from)
arg = 1;
@@ -3572,6 +3579,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");

This limitation regarding ext4 and dax should be documented in dax
Documentation.

Thanks,

Mathieu

Post by Matthew Wilcox
+ goto failed_mount;
+ }
if (test_opt(sb, DELALLOC))
clear_opt(sb, DELALLOC);
}
@@ -3635,6 +3647,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount;
}
+ if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+ if (blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "error: unsupported blocksize for dax");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext4_msg(sb, KERN_ERR,
+ "error: device does not support dax");
+ goto failed_mount;
+ }
+ }
+
if (sb->s_blocksize != blocksize) {
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
@@ -4837,6 +4862,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
err = -EINVAL;
goto restore_opts;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ err = -EINVAL;
+ goto restore_opts;
+ }
+ }
+
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+ ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
}
if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 22:16:24 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */

Execute in place -> Direct Access stuff... (comment above)

Thanks! Fixed.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext4_get_block);
+ /* Is this the right get_block? */

perhaps this needs a TODO or FIXME or XXX to make sure an ext4
maintainer does not miss this question.

Maybe I can ambush Ted in the halls tomorrow and find out? :-)

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,

Perhaps adding comments saying that .splice_read and .splice_write are
unavailable here would help understanding why we need a different file
operations structure.

Good idea. Done.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+ struct inode *inode = bh->b_assoc_map->host;
+ /* XXX: breaks on 32-bit > 16GB. Is that even supported? */

Good question! It would be interesting to get an answer :)

Another thing to check tomorrow ...

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ if (!uptodate)
+ return;
+ WARN_ON(!buffer_unwritten(bh));
+ err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);

err is simply unused here, that does not look good (silent failure).

I don't think I can do more than WARN_ON here. Maybe we can change
b_end_io() to return an int instead of void ... I think Dave Chinner has
grand plans for changes in this area as part of replacing the buffer_head
abstraction.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
@@ -3238,14 +3249,6 @@ static int ext4_block_zero_page_range(handle_t *handle,
return -ENOMEM;
blocksize = inode->i_sb->s_blocksize;
- max = blocksize - (offset & (blocksize - 1));
-
- /*
- * correct length if it does not fall between
- * 'from' and the end of the block
- */
- if (length > max || length < 0)
- length = max;
iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);

[...]

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+
+ /*
+ * correct length if it does not fall between
+ * 'from' and the end of the block
+ */

Shouldn't a length < 0 be treated as an error instead ?

Post by Matthew Wilcox
+ if (length > max || length < 0)
+ length = max;

Monkey see code in wrong place. Monkey move code. monkey not understand
code.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
@@ -3572,6 +3579,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");

This limitation regarding ext4 and dax should be documented in dax
Documentation.

Maybe the ext4 documentation too? It seems kind of obvious to me that if
ypu're enabling in-place-updates that you can't journal the data you're
updating (well ... you could implement undo-log journalling, I suppose,
which would be quite a change for ext4)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mathieu Desnoyers

2014-10-17 15:42:02 UTC

----- Original Message -----

Sent: Friday, October 17, 2014 12:16:24 AM
Subject: Re: [PATCH v11 20/21] ext4: Add DAX functionality

[...]

Post by Mathieu Desnoyers

Post by Matthew Wilcox
@@ -3572,6 +3579,11 @@ static int ext4_fill_super(struct super_block *sb,
void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");

This limitation regarding ext4 and dax should be documented in dax
Documentation.

Maybe the ext4 documentation too? It seems kind of obvious to me that if
ypu're enabling in-place-updates that you can't journal the data you're
updating (well ... you could implement undo-log journalling, I suppose,
which would be quite a change for ext4)

Yes, we could document this limitation in general for all journalling FS within
DAX documentation, and then document it specifically per-FS in the FS
documentation.

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:25 UTC

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page. Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/dax.c | 35 +++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 8 +++++---
fs/ext2/xip.c | 14 --------------
fs/ext2/xip.h | 3 ---
include/linux/fs.h | 6 ++++++
5 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 108c68e..02e226f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,8 +20,43 @@
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/mutex.h>
+#include <linux/sched.h>
#include <linux/uio.h>

+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ sector_t sector = block << (inode->i_blkbits - 9);
+
+ might_sleep();
+ do {
+ void *addr;
+ unsigned long pfn;
+ long count;
+
+ count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+ if (count < 0)
+ return count;
+ while (count > 0) {
+ unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
+ if (pgsz > count)
+ pgsz = count;
+ if (pgsz < PAGE_SIZE)
+ memset(addr, 0, pgsz);
+ else
+ clear_page(addr);
+ addr += pgsz;
+ size -= pgsz;
+ count -= pgsz;
+ sector += pgsz / 512;
+ cond_resched();
+ }
+ } while (size);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
{
unsigned long pfn;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3ccd5fd..52978b8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,

if (IS_DAX(inode)) {
/*
- * we need to clear the block
+ * block must be initialised before we put it in the tree
+ * so that it's not found by another thread before it's
+ * initialised
*/
- err = ext2_clear_xip_target (inode,
- le32_to_cpu(chain[depth-1].key));
+ err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+ 1 << inode->i_blkbits);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index bbc5fec..8cfca3a 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
return rc;
}

-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
- void *kaddr;
- unsigned long pfn;
- long size;
-
- size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
- if (size < 0)
- return size;
- clear_page(kaddr);
- return 0;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..b2592f2 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@

#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_clear_xip_target(inode, chain) 0
#define ext2_get_xip_mem NULL
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45839e8..c04d371 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,11 +2490,17 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
#else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+ return 0;
+}
+
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 10:05:25 UTC

Post by Matthew Wilcox
This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.
Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page. Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.
---
fs/dax.c | 35 +++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 8 +++++---
fs/ext2/xip.c | 14 --------------
fs/ext2/xip.h | 3 ---
include/linux/fs.h | 6 ++++++
5 files changed, 46 insertions(+), 20 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 108c68e..02e226f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,8 +20,43 @@
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/mutex.h>
+#include <linux/sched.h>
#include <linux/uio.h>
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ sector_t sector = block << (inode->i_blkbits - 9);

Is there a define e.g. SECTOR_SHIFT rather than using this hardcoded "9"
value ?

Post by Matthew Wilcox
+
+ might_sleep();
+ do {
+ void *addr;
+ unsigned long pfn;
+ long count;
+
+ count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+ if (count < 0)
+ return count;
+ while (count > 0) {
+ unsigned pgsz = PAGE_SIZE - offset_in_page(addr);

unsigned -> unsigned int

add a newline between variable declaration and following code.

Post by Matthew Wilcox
+ if (pgsz > count)
+ pgsz = count;
+ if (pgsz < PAGE_SIZE)
+ memset(addr, 0, pgsz);
+ else
+ clear_page(addr);
+ addr += pgsz;
+ size -= pgsz;
+ count -= pgsz;
+ sector += pgsz / 512;

Also wondering about those 512 constants everywhere in the code
(including prior patches). Perhaps it calls for a SECTOR_SIZE define ?

Post by Matthew Wilcox
+ cond_resched();
+ }
+ } while (size);

Just to stay on the safe side, can we do while (size > 0) ? Just in case
an unforeseen issue makes size negative, and gets us in a very long loop.

Thanks,

Mathieu

Post by Matthew Wilcox
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
{
unsigned long pfn;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3ccd5fd..52978b8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
if (IS_DAX(inode)) {
/*
- * we need to clear the block
+ * block must be initialised before we put it in the tree
+ * so that it's not found by another thread before it's
+ * initialised
*/
- err = ext2_clear_xip_target (inode,
- le32_to_cpu(chain[depth-1].key));
+ err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+ 1 << inode->i_blkbits);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index bbc5fec..8cfca3a 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
return rc;
}
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
- void *kaddr;
- unsigned long pfn;
- long size;
-
- size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
- if (size < 0)
- return size;
- clear_page(kaddr);
- return 0;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..b2592f2 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_clear_xip_target(inode, chain) 0
#define ext2_get_xip_mem NULL
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45839e8..c04d371 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,11 +2490,17 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);
#ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
#else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+ return 0;
+}
+
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 21:22:34 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ sector_t sector = block << (inode->i_blkbits - 9);

Is there a define e.g. SECTOR_SHIFT rather than using this hardcoded "9"
value ?

Yeah ... in half a dozen drivers, so introducing them globally spews
warnings about redefining macros. The '9' and '512' are sprinkled all
over the storage parts of the kernel, it's a complete flustercluck that
I wasn't about to try to unscrew.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ while (count > 0) {
+ unsigned pgsz = PAGE_SIZE - offset_in_page(addr);

unsigned -> unsigned int

Any particular reason? Omitting it in some places helps stay within
the 80-column limit without sacrificing readability.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ }
+ } while (size);

Just to stay on the safe side, can we do while (size > 0) ? Just in case
an unforeseen issue makes size negative, and gets us in a very long loop.

If size < 0, we should BUG, because that means we've zeroed more than
we were asked to do, which is data corruption.

There's probably some other hardening we should do for this loop.
For example, if 'count' is < 512, it can go into an infinite loop.

do {
void *addr;
unsigned long pfn;
long count;

count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
if (count < 0)
return count;
while (count > 0) {
unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
if (pgsz > count)
pgsz = count;
if (pgsz < PAGE_SIZE)
memset(addr, 0, pgsz);
else
clear_page(addr);
addr += pgsz;
size -= pgsz;
count -= pgsz;
BUG_ON(pgsz & 511);
sector += pgsz / 512;
cond_resched();
}
BUG_ON(size < 0);
} while (size);

I think that should do the job ... ?

Mathieu Desnoyers

2014-10-17 15:45:42 UTC

----- Original Message -----

Sent: Thursday, October 16, 2014 11:22:34 PM
Subject: Re: [PATCH v11 08/21] dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ sector_t sector = block << (inode->i_blkbits - 9);

Is there a define e.g. SECTOR_SHIFT rather than using this hardcoded "9"
value ?

Yeah ... in half a dozen drivers, so introducing them globally spews
warnings about redefining macros. The '9' and '512' are sprinkled all
over the storage parts of the kernel, it's a complete flustercluck that
I wasn't about to try to unscrew.

Fair enough.

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ while (count > 0) {
+ unsigned pgsz = PAGE_SIZE - offset_in_page(addr);

unsigned -> unsigned int

Any particular reason? Omitting it in some places helps stay within
the 80-column limit without sacrificing readability.

It looks like FS code often uses "unsigned", so I'm not too concerned.

It's just that I'm used to the Linux core kernel style, which tend to
use "unsigned int".

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ }
+ } while (size);

Just to stay on the safe side, can we do while (size > 0) ? Just in case
an unforeseen issue makes size negative, and gets us in a very long loop.

If size < 0, we should BUG, because that means we've zeroed more than
we were asked to do, which is data corruption.
There's probably some other hardening we should do for this loop.
For example, if 'count' is < 512, it can go into an infinite loop.
do {
void *addr;
unsigned long pfn;
long count;
count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
if (count < 0)
return count;
while (count > 0) {
unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
if (pgsz > count)
pgsz = count;
if (pgsz < PAGE_SIZE)
memset(addr, 0, pgsz);
else
clear_page(addr);
addr += pgsz;
size -= pgsz;
count -= pgsz;
BUG_ON(pgsz & 511);
sector += pgsz / 512;
cond_resched();
}
BUG_ON(size < 0);
} while (size);
I think that should do the job ... ?

Yep. I love defensive programming, especially for filesystems. :)

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:33 UTC

The fewer Kconfig options we have the better. Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/Kconfig | 21 ++++++++++++++-------
fs/Makefile | 2 +-
fs/ext2/Kconfig | 11 -----------
fs/ext2/ext2.h | 2 +-
fs/ext2/file.c | 4 ++--
fs/ext2/super.c | 4 ++--
include/linux/fs.h | 4 ++--
7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 312393f..a9eb53d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
source "fs/ext2/Kconfig"
source "fs/ext3/Kconfig"
source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
- bool
- depends on EXT2_FS_XIP
- default y
-
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"

@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"

+config FS_DAX
+ bool "Direct Access support"
+ depends on MMU
+ help
+ Direct Access (DAX) can be used on memory-backed block devices.
+ If the block device supports DAX and the filesystem supports DAX,
+ then you can avoid using the pagecache to buffer I/Os. Turning
+ on this option will compile in support for DAX; you will need to
+ mount the filesystem using the -o xip option.
+
+ If you do not have a block device that is capable of using this,
+ or if unsure, say N. Saying Y will increase the size of the kernel
+ by about 2kB.
+
endif # BLOCK

# Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 0325ec3..df4a4cf 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
-obj-$(CONFIG_FS_XIP) += dax.o
+obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY

If you are not using a security module that requires using
extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
- bool "Ext2 execute in place support"
- depends on EXT2_FS && MMU
- help
- Execute in place can be used on memory-backed block devices. If you
- enable this option, you can select to mount block devices which are
- capable of this feature without using the page cache.
-
- If you do not use a block device that is capable of using this,
- or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
#else
#define EXT2_MOUNT_XIP 0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index da8dc64..46b333d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
#include "xattr.h"
#include "acl.h"

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
.splice_write = iter_file_splice_write,
};

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0393c6d..feb53d8 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",grpquota");
#endif

-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
#endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
set_opt (sbi->s_mount_opt, XIP);
#else
ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d73db11..e6b48cc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,7 +1642,7 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
#else
#define IS_DAX(inode) 0
@@ -2488,7 +2488,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:26:18 UTC

Post by Matthew Wilcox
The fewer Kconfig options we have the better. Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.
---
fs/Kconfig | 21 ++++++++++++++-------
fs/Makefile | 2 +-
fs/ext2/Kconfig | 11 -----------
fs/ext2/ext2.h | 2 +-
fs/ext2/file.c | 4 ++--
fs/ext2/super.c | 4 ++--
include/linux/fs.h | 4 ++--
7 files changed, 22 insertions(+), 26 deletions(-)
diff --git a/fs/Kconfig b/fs/Kconfig
index 312393f..a9eb53d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
source "fs/ext2/Kconfig"
source "fs/ext3/Kconfig"
source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
- bool
- depends on EXT2_FS_XIP
- default y
-
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"
+config FS_DAX
+ bool "Direct Access support"
+ depends on MMU
+ help
+ Direct Access (DAX) can be used on memory-backed block devices.
+ If the block device supports DAX and the filesystem supports DAX,
+ then you can avoid using the pagecache to buffer I/Os. Turning
+ on this option will compile in support for DAX; you will need to
+ mount the filesystem using the -o xip option.

There is a mismatch between the documentation file (earlier patch): -o
dax, and this config description: -o xip.

I guess we might want to switch the mount option to "-o dax" and
document it as such, and since it should be usable transparently for the
same use-cases "-o xip" was enabling, we might want to keep parsing of
"-o xip" in the code for backward compatibility.

Thoughts ?

Thanks,

Mathieu

Post by Matthew Wilcox
+
+ If you do not have a block device that is capable of using this,
+ or if unsure, say N. Saying Y will increase the size of the kernel
+ by about 2kB.
+
endif # BLOCK
# Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 0325ec3..df4a4cf 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
-obj-$(CONFIG_FS_XIP) += dax.o
+obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
If you are not using a security module that requires using
extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
- bool "Ext2 execute in place support"
- depends on EXT2_FS && MMU
- help
- Execute in place can be used on memory-backed block devices. If you
- enable this option, you can select to mount block devices which are
- capable of this feature without using the page cache.
-
- If you do not use a block device that is capable of using this,
- or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
#else
#define EXT2_MOUNT_XIP 0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index da8dc64..46b333d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
#include "xattr.h"
#include "acl.h"
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
.splice_write = iter_file_splice_write,
};
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0393c6d..feb53d8 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",grpquota");
#endif
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
#endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
set_opt (sbi->s_mount_opt, XIP);
#else
ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d73db11..e6b48cc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,7 +1642,7 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
#else
#define IS_DAX(inode) 0
@@ -2488,7 +2488,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 21:52:56 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ bool "Direct Access support"
+ depends on MMU
+ help
+ Direct Access (DAX) can be used on memory-backed block devices.
+ If the block device supports DAX and the filesystem supports DAX,
+ then you can avoid using the pagecache to buffer I/Os. Turning
+ on this option will compile in support for DAX; you will need to
+ mount the filesystem using the -o xip option.

There is a mismatch between the documentation file (earlier patch): -o
dax, and this config description: -o xip.

Whoops! Good catch.

Post by Mathieu Desnoyers
I guess we might want to switch the mount option to "-o dax" and
document it as such, and since it should be usable transparently for the
same use-cases "-o xip" was enabling, we might want to keep parsing of
"-o xip" in the code for backward compatibility.
Thoughts ?

That's exactly what we do for ext2. For ext4, we force people to use
the new -o dax option. We stop documenting that -o xip exist, and we
print a message to tell people to switch over to -o dax.

Matthew Wilcox

2014-09-25 20:33:35 UTC

To help people transition, accept the 'xip' mount option (and report it
in /proc/mounts), but print a message encouraging people to switch over
to the 'dax' option.
---
fs/ext2/ext2.h | 13 +++++++------
fs/ext2/file.c | 2 +-
fs/ext2/inode.c | 6 +++---
fs/ext2/namei.c | 8 ++++----
fs/ext2/super.c | 25 ++++++++++++++++---------
5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..46133a0 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,14 +380,15 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
-#else
-#define EXT2_MOUNT_XIP 0
-#endif
+#define EXT2_MOUNT_XIP 0x010000 /* Obsolete, use DAX */
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
+#ifdef CONFIG_FS_DAX
+#define EXT2_MOUNT_DAX 0x100000 /* Direct Access */
+#else
+#define EXT2_MOUNT_DAX 0
+#endif

#define clear_opt(o, opt) o &= ~EXT2_MOUNT_##opt
@@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
int datasync);
extern const struct inode_operations ext2_file_inode_operations;
extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 46b333d..5b8cab5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
};

#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
.write = new_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 034fd42..6434bc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
- if (test_opt(inode->i_sb, XIP))
+ if (test_opt(inode->i_sb, DAX))
inode->i_flags |= S_DAX;
}

@@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index feb53d8..8b9debf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
+ seq_puts(seq, ",dax");
#endif

if (!test_opt(sb, RESERVATION))
@@ -393,7 +395,7 @@ enum {
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
- Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+ Opt_acl, Opt_noacl, Opt_xip, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
};

@@ -422,6 +424,7 @@ static const match_table_t tokens = {
{Opt_acl, "acl"},
{Opt_noacl, "noacl"},
{Opt_xip, "xip"},
+ {Opt_dax, "dax"},
{Opt_grpquota, "grpquota"},
{Opt_ignore, "noquota"},
{Opt_quota, "quota"},
@@ -549,10 +552,14 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
case Opt_xip:
+ ext2_msg(sb, KERN_INFO, "use dax instead of xip");
+ set_opt(sbi->s_mount_opt, XIP);
+ /* Fall through */
+ case Opt_dax:
#ifdef CONFIG_FS_DAX
- set_opt (sbi->s_mount_opt, XIP);
+ set_opt(sbi->s_mount_opt, DAX);
#else
- ext2_msg(sb, KERN_INFO, "xip option not supported");
+ ext2_msg(sb, KERN_INFO, "dax option not supported");
#endif
break;

@@ -896,15 +903,15 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
+ "error: unsupported blocksize for dax");
goto failed_mount;
}
if (!sb->s_bdev->bd_disk->fops->direct_access) {
ext2_msg(sb, KERN_ERR,
- "error: device does not support xip");
+ "error: device does not support dax");
goto failed_mount;
}
}
@@ -1276,10 +1283,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
- "xip flag with busy inodes while remounting");
- sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 12:32:53 UTC

Post by Matthew Wilcox
To help people transition, accept the 'xip' mount option (and report it
in /proc/mounts), but print a message encouraging people to switch over
to the 'dax' option.
---
fs/ext2/ext2.h | 13 +++++++------
fs/ext2/file.c | 2 +-
fs/ext2/inode.c | 6 +++---
fs/ext2/namei.c | 8 ++++----
fs/ext2/super.c | 25 ++++++++++++++++---------
5 files changed, 31 insertions(+), 23 deletions(-)
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..46133a0 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,14 +380,15 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
-#else
-#define EXT2_MOUNT_XIP 0
-#endif
+#define EXT2_MOUNT_XIP 0x010000 /* Obsolete, use DAX */
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
+#ifdef CONFIG_FS_DAX
+#define EXT2_MOUNT_DAX 0x100000 /* Direct Access */
+#else
+#define EXT2_MOUNT_DAX 0
+#endif
#define clear_opt(o, opt) o &= ~EXT2_MOUNT_##opt
@@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
int datasync);
extern const struct inode_operations ext2_file_inode_operations;
extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;
/* inode.c */
extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 46b333d..5b8cab5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
};
#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
.write = new_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 034fd42..6434bc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
- if (test_opt(inode->i_sb, XIP))
+ if (test_opt(inode->i_sb, DAX))
inode->i_flags |= S_DAX;
}
@@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);
inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);
inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index feb53d8..8b9debf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
+ seq_puts(seq, ",dax");
#endif
if (!test_opt(sb, RESERVATION))
@@ -393,7 +395,7 @@ enum {
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
- Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+ Opt_acl, Opt_noacl, Opt_xip, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
};
@@ -422,6 +424,7 @@ static const match_table_t tokens = {
{Opt_acl, "acl"},
{Opt_noacl, "noacl"},
{Opt_xip, "xip"},
+ {Opt_dax, "dax"},
{Opt_grpquota, "grpquota"},
{Opt_ignore, "noquota"},
{Opt_quota, "quota"},
@@ -549,10 +552,14 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
+ ext2_msg(sb, KERN_INFO, "use dax instead of xip");
+ set_opt(sbi->s_mount_opt, XIP);
+ /* Fall through */
#ifdef CONFIG_FS_DAX
- set_opt (sbi->s_mount_opt, XIP);
+ set_opt(sbi->s_mount_opt, DAX);
#else
- ext2_msg(sb, KERN_INFO, "xip option not supported");
+ ext2_msg(sb, KERN_INFO, "dax option not supported");
#endif
break;
@@ -896,15 +903,15 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
- if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
+ "error: unsupported blocksize for dax");
goto failed_mount;
}
if (!sb->s_bdev->bd_disk->fops->direct_access) {
ext2_msg(sb, KERN_ERR,
- "error: device does not support xip");
+ "error: device does not support dax");
goto failed_mount;
}
}
@@ -1276,10 +1283,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
- "xip flag with busy inodes while remounting");
- sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:27 UTC

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <***@intel.com>
---
fs/dax.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 2 +-
include/linux/fs.h | 4 ++--
mm/filemap_xip.c | 40 ----------------------------------------
4 files changed, 47 insertions(+), 43 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac5d3a6..6801be7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -460,3 +460,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
return result;
}
EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ struct buffer_head bh;
+ pgoff_t index = from >> PAGE_CACHE_SHIFT;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;
+ int err;
+
+ /* Block boundary? Nothing to do */
+ if (!length)
+ return 0;
+
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = PAGE_CACHE_SIZE;
+ err = get_block(inode, index, &bh, 0);
+ if (err < 0)
+ return err;
+ if (buffer_written(&bh)) {
+ void *addr;
+ err = dax_get_addr(&bh, &addr, inode->i_blkbits);
+ if (err < 0)
+ return err;
+ memset(addr + offset, 0, length);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 52978b8..5ac0a34 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);

if (IS_DAX(inode))
- error = xip_truncate_page(inode->i_mapping, newsize);
+ error = dax_truncate_page(inode, newsize, ext2_get_block);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338f04b..eee848d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2492,7 +2492,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2503,7 +2503,7 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
return 0;
}

-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
}
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize;
- unsigned length;
- void *xip_mem;
- unsigned long xip_pfn;
- int err;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- blocksize = 1 << mapping->host->i_blkbits;
- length = offset & (blocksize - 1);
-
- /* Block boundary? Nothing to do */
- if (!length)
- return 0;
-
- length = blocksize - length;
-
- err = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(err)) {
- if (err == -ENODATA)
- /* Hole? No need to truncate */
- return 0;
- else
- return err;
- }
- memset(xip_mem + offset, 0, length);
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 10:28:26 UTC

Post by Matthew Wilcox
It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()
---
fs/dax.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 2 +-
include/linux/fs.h | 4 ++--
mm/filemap_xip.c | 40 ----------------------------------------
4 files changed, 47 insertions(+), 43 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index ac5d3a6..6801be7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -460,3 +460,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
return result;
}
EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ struct buffer_head bh;
+ pgoff_t index = from >> PAGE_CACHE_SHIFT;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;

nits: unsigned -> unsigned int (I'm starting to think that FS code
perhaps has different coding style than kernel/ core code)

Post by Matthew Wilcox
+ int err;
+
+ /* Block boundary? Nothing to do */
+ if (!length)
+ return 0;
+
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = PAGE_CACHE_SIZE;
+ err = get_block(inode, index, &bh, 0);
+ if (err < 0)
+ return err;
+ if (buffer_written(&bh)) {
+ void *addr;

missing newline.

Other than that:

Reviewed-by: Mathieu Desnoyers <***@efficios.com>

Thanks,

Mathieu

Post by Matthew Wilcox
+ err = dax_get_addr(&bh, &addr, inode->i_blkbits);
+ if (err < 0)
+ return err;
+ memset(addr + offset, 0, length);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 52978b8..5ac0a34 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);
if (IS_DAX(inode))
- error = xip_truncate_page(inode->i_mapping, newsize);
+ error = dax_truncate_page(inode, newsize, ext2_get_block);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338f04b..eee848d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2492,7 +2492,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
#ifdef CONFIG_FS_XIP
int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2503,7 +2503,7 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
return 0;
}
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
}
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
#include <asm/tlbflush.h>
#include <asm/io.h>
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize;
- unsigned length;
- void *xip_mem;
- unsigned long xip_pfn;
- int err;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- blocksize = 1 << mapping->host->i_blkbits;
- length = offset & (blocksize - 1);
-
- /* Block boundary? Nothing to do */
- if (!length)
- return 0;
-
- length = blocksize - length;
-
- err = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(err)) {
- if (err == -ENODATA)
- /* Hole? No need to truncate */
- return 0;
- else
- return err;
- }
- memset(xip_mem + offset, 0, length);
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

Matthew Wilcox

2014-09-25 20:33:24 UTC

Use the generic AIO infrastructure instead of custom read and write
methods. In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Ross Zwisler <***@linux.intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
---
MAINTAINERS | 6 ++
fs/Makefile | 1 +
fs/dax.c | 195 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 6 +-
fs/ext2/inode.c | 8 +-
include/linux/fs.h | 18 ++++-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 234 -----------------------------------------------------
8 files changed, 229 insertions(+), 245 deletions(-)
create mode 100644 fs/dax.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 670b3dc..c96056b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2934,6 +2934,12 @@ L: linux-***@vger.kernel.org
S: Maintained
F: drivers/i2c/busses/i2c-diolan-u2c.c

+DIRECT ACCESS (DAX)
+M: Matthew Wilcox <***@linux.intel.com>
+L: linux-***@vger.kernel.org
+S: Supported
+F: fs/dax.c
+
DIRECTORY NOTIFICATION (DNOTIFY)
M: Eric Paris <***@parisplace.org>
S: Maintained
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..0325ec3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
+obj-$(CONFIG_FS_XIP) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..108c68e
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,195 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <***@intel.com>
+ * Author: Ross Zwisler <***@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+{
+ unsigned long pfn;
+ sector_t sector = bh->b_blocknr << (blkbits - 9);
+ return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
+ loff_t end)
+{
+ loff_t final = end - pos + first; /* The final byte of the buffer */
+
+ if (first > 0)
+ memset(addr, 0, first);
+ if (final < size)
+ memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+ return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+/*
+ * When ext4 encounters a hole, it returns without modifying the buffer_head
+ * which means that we can't trust b_size. To cope with this, we set b_state
+ * to 0 before calling get_block and, if any bit is set, we know we can trust
+ * b_size. Unfortunate, really, since ext4 knows precisely how long a hole is
+ * and would save us time calling get_block repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+ return bh->b_state != 0;
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
+ loff_t start, loff_t end, get_block_t get_block,
+ struct buffer_head *bh)
+{
+ ssize_t retval = 0;
+ loff_t pos = start;
+ loff_t max = start;
+ loff_t bh_max = start;
+ void *addr;
+ bool hole = false;
+
+ if (rw != WRITE)
+ end = min(end, i_size_read(inode));
+
+ while (pos < end) {
+ unsigned len;
+ if (pos == max) {
+ unsigned blkbits = inode->i_blkbits;
+ sector_t block = pos >> blkbits;
+ unsigned first = pos - (block << blkbits);
+ long size;
+
+ if (pos == bh_max) {
+ bh->b_size = PAGE_ALIGN(end - pos);
+ bh->b_state = 0;
+ retval = get_block(inode, block, bh,
+ rw == WRITE);
+ if (retval)
+ break;
+ if (!buffer_size_valid(bh))
+ bh->b_size = 1 << blkbits;
+ bh_max = pos - first + bh->b_size;
+ } else {
+ unsigned done = bh->b_size -
+ (bh_max - (pos - first));
+ bh->b_blocknr += done >> blkbits;
+ bh->b_size -= done;
+ }
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ /* FIXME: fall back to buffered I/O */
+ break;
+ }
+ hole = false;
+ } else {
+ hole = !buffer_written(bh);
+ }
+
+ if (hole) {
+ addr = NULL;
+ size = bh->b_size - first;
+ } else {
+ retval = dax_get_addr(bh, &addr, blkbits);
+ if (retval < 0)
+ break;
+ if (buffer_unwritten(bh) || buffer_new(bh))
+ dax_new_buf(addr, retval, first, pos,
+ end);
+ addr += first;
+ size = retval - first;
+ }
+ max = min(pos + size, end);
+ }
+
+ if (rw == WRITE)
+ len = copy_from_iter(addr, max - pos, iter);
+ else if (!hole)
+ len = copy_to_iter(addr, max - pos, iter);
+ else
+ len = iov_iter_zero(max - pos, iter);
+
+ if (!len)
+ break;
+
+ pos += len;
+ addr += len;
+ }
+
+ return (pos == start) ? retval : pos - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iter: The addresses to do I/O from or to
+ * @pos: The file offset where the I/O starts
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes. For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ struct iov_iter *iter, loff_t pos,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ struct buffer_head bh;
+ ssize_t retval = -EINVAL;
+ loff_t end = pos + iov_iter_count(iter);
+
+ memset(&bh, 0, sizeof(bh));
+
+ if ((flags & DIO_LOCKING) && (rw == READ)) {
+ struct address_space *mapping = inode->i_mapping;
+ mutex_lock(&inode->i_mutex);
+ retval = filemap_write_and_wait_range(mapping, pos, end - 1);
+ if (retval) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
+ }
+
+ /* Protects against truncate */
+ atomic_inc(&inode->i_dio_count);
+
+ retval = dax_io(rw, inode, iter, pos, end, get_block, &bh);
+
+ if ((flags & DIO_LOCKING) && (rw == READ))
+ mutex_unlock(&inode->i_mutex);
+
+ if ((retval > 0) && end_io)
+ end_io(iocb, pos, retval, bh.b_private);
+
+ inode_dio_done(inode);
+ out:
+ return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7c87b22..a247123 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_EXT2_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
- .read = xip_file_read,
- .write = xip_file_write,
+ .read = new_sync_read,
+ .write = new_sync_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0cb0448..3ccd5fd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -859,7 +859,12 @@ ext2_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
size_t count = iov_iter_count(iter);
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext2_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset, ext2_get_block,
+ NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
+ ext2_get_block);
if (ret < 0 && (rw & WRITE))
ext2_write_failed(mapping, offset + count);
return ret;
@@ -888,6 +893,7 @@ const struct address_space_operations ext2_aops = {
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
.get_xip_mem = ext2_get_xip_mem,
+ .direct_IO = ext2_direct_IO,
};

const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e99e5c4..45839e8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,17 +2490,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
- loff_t *ppos);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
- size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
+ loff_t, get_block_t, dio_iodone_t, int flags);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
}
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
+ struct inode *inode, struct iov_iter *iter, loff_t pos,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ return -ENOTTY;
+}
#endif

#ifdef CONFIG_BLOCK
@@ -2657,6 +2662,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);

+static inline bool io_is_direct(struct file *filp)
+{
+ return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index fec4db9..e69b586 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1690,8 +1690,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
loff_t *ppos = &iocb->ki_pos;
loff_t pos = *ppos;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (file->f_flags & O_DIRECT) {
+ if (io_is_direct(file)) {
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
size_t count = iov_iter_count(iter);
@@ -2581,8 +2580,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (err)
goto out;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (unlikely(file->f_flags & O_DIRECT)) {
+ if (io_is_direct(file)) {
loff_t endbyte;

written = generic_file_direct_write(iocb, from, pos);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
}

/*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all. It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
- struct file_ra_state *_ra,
- struct file *filp,
- char __user *buf,
- size_t len,
- loff_t *ppos)
-{
- struct inode *inode = mapping->host;
- pgoff_t index, end_index;
- unsigned long offset;
- loff_t isize, pos;
- size_t copied = 0, error = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- pos = *ppos;
- index = pos >> PAGE_CACHE_SHIFT;
- offset = pos & ~PAGE_CACHE_MASK;
-
- isize = i_size_read(inode);
- if (!isize)
- goto out;
-
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- do {
- unsigned long nr, left;
- void *xip_mem;
- unsigned long xip_pfn;
- int zero = 0;
-
- /* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
- if (index >= end_index) {
- if (index > end_index)
- goto out;
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (nr <= offset) {
- goto out;
- }
- }
- nr = nr - offset;
- if (nr > len - copied)
- nr = len - copied;
-
- error = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(error)) {
- if (error == -ENODATA) {
- /* sparse */
- zero = 1;
- } else
- goto out;
- }
-
- /* If users can be writing to this page using arbitrary
- * virtual addresses, take care about potential aliasing
- * before reading the page on the kernel side.
- */
- if (mapping_writably_mapped(mapping))
- /* address based flush */ ;
-
- /*
- * Ok, we have the mem, so now we can copy it to user space...
- *
- * The actor routine returns how many bytes were actually used..
- * NOTE! This may not be the same as how much of a user buffer
- * we filled up (we may be padding etc), so we can only update
- * "pos" here (the actor routine has to update the user buffer
- * pointers and the remaining count).
- */
- if (!zero)
- left = __copy_to_user(buf+copied, xip_mem+offset, nr);
- else
- left = __clear_user(buf + copied, nr);
-
- if (left) {
- error = -EFAULT;
- goto out;
- }
-
- copied += (nr - left);
- offset += (nr - left);
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
- } while (copied < len);
-
-out:
- *ppos = pos + copied;
- if (filp)
- file_accessed(filp);
-
- return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
- if (!access_ok(VERIFY_WRITE, buf, len))
- return -EFAULT;
-
- return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
* __xip_unmap is invoked from xip_unmap and
* xip_write
*
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
}
EXPORT_SYMBOL_GPL(xip_file_mmap);

-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
- size_t count, loff_t pos, loff_t *ppos)
-{
- struct address_space * mapping = filp->f_mapping;
- const struct address_space_operations *a_ops = mapping->a_ops;
- struct inode *inode = mapping->host;
- long status = 0;
- size_t bytes;
- ssize_t written = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- do {
- unsigned long index;
- unsigned long offset;
- size_t copied;
- void *xip_mem;
- unsigned long xip_pfn;
-
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count)
- bytes = count;
-
- status = a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (status == -ENODATA) {
- /* we allocate a new page unmap it */
- mutex_lock(&xip_sparse_mutex);
- status = a_ops->get_xip_mem(mapping, index, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (!status)
- /* unmap page at pgoff from all other vmas */
- __xip_unmap(mapping, index);
- }
-
- if (status)
- break;
-
- copied = bytes -
- __copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
- if (likely(copied > 0)) {
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- }
- }
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
- if (status < 0)
- break;
- } while (count);
- *ppos = pos;
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- */
- if (pos > inode->i_size) {
- i_size_write(inode, pos);
- mark_inode_dirty(inode);
- }
-
- return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct address_space *mapping = filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count;
- loff_t pos;
- ssize_t ret;
-
- mutex_lock(&inode->i_mutex);
-
- if (!access_ok(VERIFY_READ, buf, len)) {
- ret=-EFAULT;
- goto out_up;
- }
-
- pos = *ppos;
- count = len;
-
- /* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
-
- ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
- if (ret)
- goto out_backing;
- if (count == 0)
- goto out_backing;
-
- ret = file_remove_suid(filp);
- if (ret)
- goto out_backing;
-
- ret = file_update_time(filp);
- if (ret)
- goto out_backing;
-
- ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
- current->backing_dev_info = NULL;
- out_up:
- mutex_unlock(&inode->i_mutex);
- return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 09:50:27 UTC

Post by Matthew Wilcox
Use the generic AIO infrastructure instead of custom read and write
methods. In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().
---
MAINTAINERS | 6 ++
fs/Makefile | 1 +
fs/dax.c | 195 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 6 +-
fs/ext2/inode.c | 8 +-
include/linux/fs.h | 18 ++++-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 234 -----------------------------------------------------
8 files changed, 229 insertions(+), 245 deletions(-)
create mode 100644 fs/dax.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 670b3dc..c96056b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2934,6 +2934,12 @@ L: linux-***@vger.kernel.org
S: Maintained
F: drivers/i2c/busses/i2c-diolan-u2c.c
+DIRECT ACCESS (DAX)
+S: Supported
+F: fs/dax.c
+
DIRECTORY NOTIFICATION (DNOTIFY)
S: Maintained
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..0325ec3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
+obj-$(CONFIG_FS_XIP) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..108c68e
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,195 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+{
+ unsigned long pfn;
+ sector_t sector = bh->b_blocknr << (blkbits - 9);
+ return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
+ loff_t end)
+{
+ loff_t final = end - pos + first; /* The final byte of the buffer */
+
+ if (first > 0)
+ memset(addr, 0, first);
+ if (final < size)
+ memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+ return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+/*
+ * When ext4 encounters a hole, it returns without modifying the buffer_head
+ * which means that we can't trust b_size. To cope with this, we set b_state
+ * to 0 before calling get_block and, if any bit is set, we know we can trust
+ * b_size. Unfortunate, really, since ext4 knows precisely how long a hole is
+ * and would save us time calling get_block repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+ return bh->b_state != 0;
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
+ loff_t start, loff_t end, get_block_t get_block,
+ struct buffer_head *bh)
+{
+ ssize_t retval = 0;
+ loff_t pos = start;
+ loff_t max = start;
+ loff_t bh_max = start;
+ void *addr;
+ bool hole = false;
+
+ if (rw != WRITE)
+ end = min(end, i_size_read(inode));
+
+ while (pos < end) {
+ unsigned len;
+ if (pos == max) {
+ unsigned blkbits = inode->i_blkbits;
+ sector_t block = pos >> blkbits;
+ unsigned first = pos - (block << blkbits);
+ long size;
+
+ if (pos == bh_max) {
+ bh->b_size = PAGE_ALIGN(end - pos);
+ bh->b_state = 0;
+ retval = get_block(inode, block, bh,
+ rw == WRITE);
+ if (retval)
+ break;
+ if (!buffer_size_valid(bh))
+ bh->b_size = 1 << blkbits;
+ bh_max = pos - first + bh->b_size;
+ } else {
+ unsigned done = bh->b_size -
+ (bh_max - (pos - first));
+ bh->b_blocknr += done >> blkbits;
+ bh->b_size -= done;
+ }
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ /* FIXME: fall back to buffered I/O */

Fallback on buffered I/O would void guarantee about having data stored
into persistent memory after write returns. Not sure we actually want
that.

Thanks,

Mathieu

Post by Matthew Wilcox
+ break;
+ }
+ hole = false;
+ } else {
+ hole = !buffer_written(bh);
+ }
+
+ if (hole) {
+ addr = NULL;
+ size = bh->b_size - first;
+ } else {
+ retval = dax_get_addr(bh, &addr, blkbits);
+ if (retval < 0)
+ break;
+ if (buffer_unwritten(bh) || buffer_new(bh))
+ dax_new_buf(addr, retval, first, pos,
+ end);
+ addr += first;
+ size = retval - first;
+ }
+ max = min(pos + size, end);
+ }
+
+ if (rw == WRITE)
+ len = copy_from_iter(addr, max - pos, iter);
+ else if (!hole)
+ len = copy_to_iter(addr, max - pos, iter);
+ else
+ len = iov_iter_zero(max - pos, iter);
+
+ if (!len)
+ break;
+
+ pos += len;
+ addr += len;
+ }
+
+ return (pos == start) ? retval : pos - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ *
+ * caller for writes. For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ struct iov_iter *iter, loff_t pos,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ struct buffer_head bh;
+ ssize_t retval = -EINVAL;
+ loff_t end = pos + iov_iter_count(iter);
+
+ memset(&bh, 0, sizeof(bh));
+
+ if ((flags & DIO_LOCKING) && (rw == READ)) {
+ struct address_space *mapping = inode->i_mapping;
+ mutex_lock(&inode->i_mutex);
+ retval = filemap_write_and_wait_range(mapping, pos, end - 1);
+ if (retval) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
+ }
+
+ /* Protects against truncate */
+ atomic_inc(&inode->i_dio_count);
+
+ retval = dax_io(rw, inode, iter, pos, end, get_block, &bh);
+
+ if ((flags & DIO_LOCKING) && (rw == READ))
+ mutex_unlock(&inode->i_mutex);
+
+ if ((retval > 0) && end_io)
+ end_io(iocb, pos, retval, bh.b_private);
+
+ inode_dio_done(inode);
+ return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7c87b22..a247123 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_EXT2_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
- .read = xip_file_read,
- .write = xip_file_write,
+ .read = new_sync_read,
+ .write = new_sync_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0cb0448..3ccd5fd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -859,7 +859,12 @@ ext2_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
size_t count = iov_iter_count(iter);
ssize_t ret;
- ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext2_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset, ext2_get_block,
+ NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
+ ext2_get_block);
if (ret < 0 && (rw & WRITE))
ext2_write_failed(mapping, offset + count);
return ret;
@@ -888,6 +893,7 @@ const struct address_space_operations ext2_aops = {
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
.get_xip_mem = ext2_get_xip_mem,
+ .direct_IO = ext2_direct_IO,
};
const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e99e5c4..45839e8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,17 +2490,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);
#ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
- loff_t *ppos);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
- size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
+ loff_t, get_block_t, dio_iodone_t, int flags);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
}
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb,
+ struct inode *inode, struct iov_iter *iter, loff_t pos,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ return -ENOTTY;
+}
#endif
#ifdef CONFIG_BLOCK
@@ -2657,6 +2662,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);
+static inline bool io_is_direct(struct file *filp)
+{
+ return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index fec4db9..e69b586 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1690,8 +1690,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
loff_t *ppos = &iocb->ki_pos;
loff_t pos = *ppos;
- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (file->f_flags & O_DIRECT) {
+ if (io_is_direct(file)) {
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
size_t count = iov_iter_count(iter);
@@ -2581,8 +2580,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (err)
goto out;
- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (unlikely(file->f_flags & O_DIRECT)) {
+ if (io_is_direct(file)) {
loff_t endbyte;
written = generic_file_direct_write(iocb, from, pos);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
}
/*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all. It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
- struct file_ra_state *_ra,
- struct file *filp,
- char __user *buf,
- size_t len,
- loff_t *ppos)
-{
- struct inode *inode = mapping->host;
- pgoff_t index, end_index;
- unsigned long offset;
- loff_t isize, pos;
- size_t copied = 0, error = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- pos = *ppos;
- index = pos >> PAGE_CACHE_SHIFT;
- offset = pos & ~PAGE_CACHE_MASK;
-
- isize = i_size_read(inode);
- if (!isize)
- goto out;
-
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- do {
- unsigned long nr, left;
- void *xip_mem;
- unsigned long xip_pfn;
- int zero = 0;
-
- /* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
- if (index >= end_index) {
- if (index > end_index)
- goto out;
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (nr <= offset) {
- goto out;
- }
- }
- nr = nr - offset;
- if (nr > len - copied)
- nr = len - copied;
-
- error = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(error)) {
- if (error == -ENODATA) {
- /* sparse */
- zero = 1;
- } else
- goto out;
- }
-
- /* If users can be writing to this page using arbitrary
- * virtual addresses, take care about potential aliasing
- * before reading the page on the kernel side.
- */
- if (mapping_writably_mapped(mapping))
- /* address based flush */ ;
-
- /*
- * Ok, we have the mem, so now we can copy it to user space...
- *
- * The actor routine returns how many bytes were actually used..
- * NOTE! This may not be the same as how much of a user buffer
- * we filled up (we may be padding etc), so we can only update
- * "pos" here (the actor routine has to update the user buffer
- * pointers and the remaining count).
- */
- if (!zero)
- left = __copy_to_user(buf+copied, xip_mem+offset, nr);
- else
- left = __clear_user(buf + copied, nr);
-
- if (left) {
- error = -EFAULT;
- goto out;
- }
-
- copied += (nr - left);
- offset += (nr - left);
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
- } while (copied < len);
-
- *ppos = pos + copied;
- if (filp)
- file_accessed(filp);
-
- return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
- if (!access_ok(VERIFY_WRITE, buf, len))
- return -EFAULT;
-
- return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
* __xip_unmap is invoked from xip_unmap and
* xip_write
*
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
}
EXPORT_SYMBOL_GPL(xip_file_mmap);
-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
- size_t count, loff_t pos, loff_t *ppos)
-{
- struct address_space * mapping = filp->f_mapping;
- const struct address_space_operations *a_ops = mapping->a_ops;
- struct inode *inode = mapping->host;
- long status = 0;
- size_t bytes;
- ssize_t written = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- do {
- unsigned long index;
- unsigned long offset;
- size_t copied;
- void *xip_mem;
- unsigned long xip_pfn;
-
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count)
- bytes = count;
-
- status = a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (status == -ENODATA) {
- /* we allocate a new page unmap it */
- mutex_lock(&xip_sparse_mutex);
- status = a_ops->get_xip_mem(mapping, index, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (!status)
- /* unmap page at pgoff from all other vmas */
- __xip_unmap(mapping, index);
- }
-
- if (status)
- break;
-
- copied = bytes -
- __copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
- if (likely(copied > 0)) {
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- }
- }
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
- if (status < 0)
- break;
- } while (count);
- *ppos = pos;
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- */
- if (pos > inode->i_size) {
- i_size_write(inode, pos);
- mark_inode_dirty(inode);
- }
-
- return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct address_space *mapping = filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count;
- loff_t pos;
- ssize_t ret;
-
- mutex_lock(&inode->i_mutex);
-
- if (!access_ok(VERIFY_READ, buf, len)) {
- ret=-EFAULT;
- goto out_up;
- }
-
- pos = *ppos;
- count = len;
-
- /* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
-
- ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
- if (ret)
- goto out_backing;
- if (count == 0)
- goto out_backing;
-
- ret = file_remove_suid(filp);
- if (ret)
- goto out_backing;
-
- ret = file_update_time(filp);
- if (ret)
- goto out_backing;
-
- ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- current->backing_dev_info = NULL;
- mutex_unlock(&inode->i_mutex);
- return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 19:51:12 UTC

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ /* FIXME: fall back to buffered I/O */

Fallback on buffered I/O would void guarantee about having data stored
into persistent memory after write returns. Not sure we actually want
that.

Yeah, I think that comment is just stale. I can't see a way in which
buffered I/O would succeed after DAX I/O falis.

Matthew Wilcox

2014-10-16 22:33:31 UTC

Post by Matthew Wilcox

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ /* FIXME: fall back to buffered I/O */

Fallback on buffered I/O would void guarantee about having data stored
into persistent memory after write returns. Not sure we actually want
that.

Yeah, I think that comment is just stale. I can't see a way in which
buffered I/O would succeed after DAX I/O falis.

On further consideration, I think the whole thing is just foolish.
I don't see how get_block(create == 1) can return success *and* a buffer
that is !mapped.

So I did this nice simplification:

- if (rw == WRITE) {
- if (!buffer_mapped(bh)) {
- retval = -EIO;
- /* FIXME: fall back to buffered I/O */
- break;
- }
- hole = false;
- } else {
- hole = !buffer_written(bh);
- }
+ hole = (rw != WRITE) && !buffer_written(bh);

(compile-tested only; I'm going to run all the changes through xfstests
next week when I'm back home before sending out a v12).

Mathieu Desnoyers

2014-10-17 15:52:14 UTC

----- Original Message -----

Sent: Friday, October 17, 2014 12:33:31 AM
Subject: Re: [PATCH v11 07/21] dax,ext2: Replace XIP read and write with DAX I/O

Post by Matthew Wilcox

Post by Mathieu Desnoyers

Post by Matthew Wilcox
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ /* FIXME: fall back to buffered I/O */

Fallback on buffered I/O would void guarantee about having data stored
into persistent memory after write returns. Not sure we actually want
that.

Yeah, I think that comment is just stale. I can't see a way in which
buffered I/O would succeed after DAX I/O falis.

On further consideration, I think the whole thing is just foolish.
I don't see how get_block(create == 1) can return success *and* a buffer
that is !mapped.

Perhaps a safe approach could be to put a BUG_ON() to check this assumption ?

Thanks,

Mathieu

- if (rw == WRITE) {
- if (!buffer_mapped(bh)) {
- retval = -EIO;
- /* FIXME: fall back to buffered I/O */
- break;
- }
- hole = false;
- } else {
- hole = !buffer_written(bh);
- }
+ hole = (rw != WRITE) && !buffer_written(bh);
(compile-tested only; I'm going to run all the changes through xfstests
next week when I'm back home before sending out a v12).

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:33:38 UTC

From: Matthew Wilcox <***@linux.intel.com>

Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox <***@intel.com>
---
drivers/block/Kconfig | 13 +++++++------
drivers/block/brd.c | 14 +++++++-------
fs/Kconfig | 4 ++--
3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..1b8094d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE
The default value is 4096 kilobytes. Only change this if you know
what you are doing.

-config BLK_DEV_XIP
- bool "Support XIP filesystems on RAM block device"
- depends on BLK_DEV_RAM
+config BLK_DEV_RAM_DAX
+ bool "Support Direct Access (DAX) to RAM block devices"
+ depends on BLK_DEV_RAM && FS_DAX
default n
help
- Support XIP filesystems (such as ext2 with XIP support on) on
- top of block ram device. This will slightly enlarge the kernel, and
- will prevent RAM block device backing store memory from being
+ Support filesystems using DAX to access RAM block devices. This
+ avoids double-buffering data in the page cache before copying it
+ to the block device. Answering Y will slightly enlarge the kernel,
+ and will prevent RAM block device backing store memory from being
allocated from highmem (only a problem for highmem systems).

config CDROM_PKTCDVD
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 78fe510..97c55db 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
* Must use NOIO because we don't want to recurse back into the
* block or filesystem layers from page reclaim.
*
- * Cannot support XIP and highmem, because our ->direct_access
- * routine for XIP must return memory that is always addressable.
- * If XIP was reworked to use pfns and kmap throughout, this
+ * Cannot support DAX and highmem, because our ->direct_access
+ * routine for DAX must return memory that is always addressable.
+ * If DAX was reworked to use pfns and kmap throughout, this
* restriction might be able to be lifted.
*/
gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_XIP
+#ifndef CONFIG_BLK_DEV_RAM_DAX
gfp_flags |= __GFP_HIGHMEM;
#endif
page = alloc_page(gfp_flags);
@@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
return err;
}

-#ifdef CONFIG_BLK_DEV_XIP
+#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn, long size)
{
@@ -388,6 +388,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
* file happens to be mapped to the next page of physical RAM */
return PAGE_SIZE;
}
+#else
+#define brd_direct_access NULL
#endif

static int brd_ioctl(struct block_device *bdev, fmode_t mode,
@@ -428,9 +430,7 @@ static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
.ioctl = brd_ioctl,
-#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
-#endif
};

/*
diff --git a/fs/Kconfig b/fs/Kconfig
index a9eb53d..117900f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"

config FS_DAX
- bool "Direct Access support"
+ bool "Direct Access (DAX) support"
depends on MMU
help
Direct Access (DAX) can be used on memory-backed block devices.
@@ -45,7 +45,7 @@ config FS_DAX

If you do not have a block device that is capable of using this,
or if unsure, say N. Saying Y will increase the size of the kernel
- by about 2kB.
+ by about 5kB.

endif # BLOCK

--
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 13:00:16 UTC

Post by Matthew Wilcox
Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.
---
drivers/block/Kconfig | 13 +++++++------
drivers/block/brd.c | 14 +++++++-------
fs/Kconfig | 4 ++--
3 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..1b8094d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE
The default value is 4096 kilobytes. Only change this if you know
what you are doing.
-config BLK_DEV_XIP
- bool "Support XIP filesystems on RAM block device"
- depends on BLK_DEV_RAM
+config BLK_DEV_RAM_DAX
+ bool "Support Direct Access (DAX) to RAM block devices"
+ depends on BLK_DEV_RAM && FS_DAX
default n
help
- Support XIP filesystems (such as ext2 with XIP support on) on
- top of block ram device. This will slightly enlarge the kernel, and
- will prevent RAM block device backing store memory from being
+ Support filesystems using DAX to access RAM block devices. This
+ avoids double-buffering data in the page cache before copying it
+ to the block device. Answering Y will slightly enlarge the kernel,
+ and will prevent RAM block device backing store memory from being
allocated from highmem (only a problem for highmem systems).
config CDROM_PKTCDVD
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 78fe510..97c55db 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
* Must use NOIO because we don't want to recurse back into the
* block or filesystem layers from page reclaim.
*
- * Cannot support XIP and highmem, because our ->direct_access
- * routine for XIP must return memory that is always addressable.
- * If XIP was reworked to use pfns and kmap throughout, this
+ * Cannot support DAX and highmem, because our ->direct_access
+ * routine for DAX must return memory that is always addressable.
+ * If DAX was reworked to use pfns and kmap throughout, this

So this might be an important limitation on x86-32 with PAE, am I
correct ? It should be eventually investigated if anyone still care, but
it does not appear to be a roadblocking limitation.

Thanks,

Mathieu

Post by Matthew Wilcox
* restriction might be able to be lifted.
*/
gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_XIP
+#ifndef CONFIG_BLK_DEV_RAM_DAX
gfp_flags |= __GFP_HIGHMEM;
#endif
page = alloc_page(gfp_flags);
@@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
return err;
}
-#ifdef CONFIG_BLK_DEV_XIP
+#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn, long size)
{
@@ -388,6 +388,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
* file happens to be mapped to the next page of physical RAM */
return PAGE_SIZE;
}
+#else
+#define brd_direct_access NULL
#endif
static int brd_ioctl(struct block_device *bdev, fmode_t mode,
@@ -428,9 +430,7 @@ static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
.ioctl = brd_ioctl,
-#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
-#endif
};
/*
diff --git a/fs/Kconfig b/fs/Kconfig
index a9eb53d..117900f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"
config FS_DAX
- bool "Direct Access support"
+ bool "Direct Access (DAX) support"
depends on MMU
help
Direct Access (DAX) can be used on memory-backed block devices.
@@ -45,7 +45,7 @@ config FS_DAX
If you do not have a block device that is capable of using this,
or if unsure, say N. Saying Y will increase the size of the kernel
- by about 2kB.
+ by about 5kB.
endif # BLOCK
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-25 20:47:49 UTC

On Thu, Sep 25, 2014 at 04:33:17PM -0400, Matthew Wilcox wrote:
...

For those who want to see what changed beteen v10 and v11, here are the patches
that I applied to my internal git tree.

V***@vt.edu

2014-09-30 09:45:01 UTC

Patch 19 adds some DAX infrastructure to support ext4.
Patch 20 adds DAX support to ext4. It is broadly similar to ext2's DAX
support, but it is more efficient than ext4's due to its support for
unwritten extents.

I don't currently have a use case for NV-DIMM support.

However, it would be nice if this code could be leveraged to support
'force O_DIRECT on all I/O to this file' - that I *do* have a use
case for. Patch 20 looks to my untrained eye like it *almost* gets
there.

(And if in fact it *does* do the whole enchilada, the Changelog etc should
mention it :)

Matthew Wilcox

2014-09-30 14:48:54 UTC

Post by V***@vt.edu

Patch 19 adds some DAX infrastructure to support ext4.
Patch 20 adds DAX support to ext4. It is broadly similar to ext2's DAX
support, but it is more efficient than ext4's due to its support for
unwritten extents.

I don't currently have a use case for NV-DIMM support.
However, it would be nice if this code could be leveraged to support
'force O_DIRECT on all I/O to this file' - that I *do* have a use
case for. Patch 20 looks to my untrained eye like it *almost* gets
there.
(And if in fact it *does* do the whole enchilada, the Changelog etc should
mention it :)

No, it doesn't try to do that. Wouldn't you be better served with an
LD_PRELOAD that forces O_DIRECT on?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

V***@vt.edu

2014-09-30 14:53:47 UTC

Post by Matthew Wilcox
No, it doesn't try to do that. Wouldn't you be better served with an
LD_PRELOAD that forces O_DIRECT on?

Not when you don't want it on every file, and users are creating and
deleting files once in a while. A chattr-like command is easier and
more scalable than rebuilding the LD_PRELOAD every time the list of
files gets changed....

Matthew Wilcox

2014-09-30 16:08:41 UTC

Post by V***@vt.edu

Post by Matthew Wilcox
No, it doesn't try to do that. Wouldn't you be better served with an
LD_PRELOAD that forces O_DIRECT on?

Not when you don't want it on every file, and users are creating and
deleting files once in a while. A chattr-like command is easier and
more scalable than rebuilding the LD_PRELOAD every time the list of
files gets changed....

The more I think about this, the more I think this is a bad idea.
When you have a file open with O_DIRECT, your I/O has to be done in
512-byte multiples, and it has to be aligned to 512-byte boundaries
in memory. If an unsuspecting application has O_DIRECT forced on it,
it isn't going to know to do that, and so all its I/Os will fail.
It'll also be horribly inefficient if a program has the file mmaped.

What problem are you really trying to solve? Some big files hogging
the page cache?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Zuckerman, Boris

2014-09-30 17:10:26 UTC

Post by Matthew Wilcox
The more I think about this, the more I think this is a bad idea.
When you have a file open with O_DIRECT, your I/O has to be done in 512-byte
multiples, and it has to be aligned to 512-byte boundaries in memory. If an
unsuspecting application has O_DIRECT forced on it, it isn't going to know to do that,
and so all its I/Os will fail.
It'll also be horribly inefficient if a program has the file mmaped.
What problem are you really trying to solve? Some big files hogging the page cache?
--

Page cache? As another copy in RAM?
NV_DIMMs may be viewed as a caching device. This caching can be implemented on the level of NV block/offset or may have some hints from FS and applications. Temporary files is one example. They may not need to hit NV domain ever. Some transactional journals or DB files is another example. They may stay in RAM until power off.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-09-30 19:24:28 UTC

Post by Zuckerman, Boris

Post by Matthew Wilcox
The more I think about this, the more I think this is a bad idea.
When you have a file open with O_DIRECT, your I/O has to be done in 512-byte
multiples, and it has to be aligned to 512-byte boundaries in memory. If an
unsuspecting application has O_DIRECT forced on it, it isn't going to know to do that,
and so all its I/Os will fail.
It'll also be horribly inefficient if a program has the file mmaped.
What problem are you really trying to solve? Some big files hogging the page cache?
--

Page cache? As another copy in RAM?
NV_DIMMs may be viewed as a caching device. This caching can be implemented on the level of NV block/offset or may have some hints from FS and applications. Temporary files is one example. They may not need to hit NV domain ever. Some transactional journals or DB files is another example. They may stay in RAM until power off.

Boris, you're confused. Valdis is trying to solve an unrelated problem
(and hopes my DAX patches will do it for him). I'm explaining to him why
what he wants to do is a bad idea. This tangent is unrelated to NV-DIMMs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Zuckerman, Boris

2014-09-30 19:31:55 UTC

I am trying to refocus this thread from a particular issue to more generic needs...

Regards, Boris

-----Original Message-----
Sent: Tuesday, September 30, 2014 3:24 PM
To: Zuckerman, Boris
Subject: Re: [PATCH v11 00/21] Add support for NV-DIMMs to ext4

Post by Zuckerman, Boris

Post by Matthew Wilcox
The more I think about this, the more I think this is a bad idea.
When you have a file open with O_DIRECT, your I/O has to be done in
512-byte multiples, and it has to be aligned to 512-byte boundaries
in memory. If an unsuspecting application has O_DIRECT forced on
it, it isn't going to know to do that, and so all its I/Os will fail.
It'll also be horribly inefficient if a program has the file mmaped.
What problem are you really trying to solve? Some big files hogging the page

cache?

Post by Zuckerman, Boris

Post by Matthew Wilcox
--

Page cache? As another copy in RAM?
NV_DIMMs may be viewed as a caching device. This caching can be implemented on

the level of NV block/offset or may have some hints from FS and applications.
Temporary files is one example. They may not need to hit NV domain ever. Some
transactional journals or DB files is another example. They may stay in RAM until power
off.
Boris, you're confused. Valdis is trying to solve an unrelated problem (and hopes my
DAX patches will do it for him). I'm explaining to him why what he wants to do is a bad
idea. This tangent is unrelated to NV-DIMMs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

V***@vt.edu

2014-09-30 20:37:56 UTC

Post by Matthew Wilcox
The more I think about this, the more I think this is a bad idea.
When you have a file open with O_DIRECT, your I/O has to be done in
512-byte multiples, and it has to be aligned to 512-byte boundaries
in memory. If an unsuspecting application has O_DIRECT forced on it,
it isn't going to know to do that, and so all its I/Os will fail.

I'm thinking of more than one place where that would be a feature, not a bug. :)

Post by Matthew Wilcox
What problem are you really trying to solve? Some big files hogging
the page cache?

I'm officially a storage admin. I mostly support HPC and research. As
such, I'm always looking to add tools to my toolkit. :)

(And yes, I fully recognize that *in general*, this is a Bad Idea. However,
when you've got That One Problem Data File that *should* always be access
via O_DIRECT, and *usually* is accessed via O_DIRECT, and bad things happen
if something accesses it without it (for instance, when the file is 1.5X the
actual RAM), you start looking for fixes. If you've got another, more
sustainable way to say "do not let file /X/Y/Z hog the page cache" (and
no, LD_PRELOAD isn't sustainable the way chattr is, in my book), feel free to
recommend something. :)

Andreas Dilger

2014-09-30 21:25:17 UTC

Post by V***@vt.edu

Post by Matthew Wilcox
The more I think about this, the more I think this is a bad idea.
When you have a file open with O_DIRECT, your I/O has to be done in
512-byte multiples, and it has to be aligned to 512-byte boundaries
in memory. If an unsuspecting application has O_DIRECT forced on it,
it isn't going to know to do that, and so all its I/Os will fail.

I'm thinking of more than one place where that would be a feature, not a bug. :)

We prototyped a feature like this for Lustre - so the admins could
turn IO into O_DIRECT, because the HPC compute nodes have relatively
small RAM per core and don't want to have file data cache consuming
RAM that the compute jobs need.

Unfortunately, the O_DIRECT semantics are a killer for poorly written
applications that end up doing small synchronous writes. We didn't
have any IO size problems, because Lustre client have to copy the data
to the servers anyway, so arbitrary IO sizes are fine.

While this _might_ be OK for NVRAM mapped directly into the filesystem,
even for local disk based storage with 512-byte writes at 100 IOPS is
only 50KB/s instead of ~100MB/s for a cached writes to a single disk.

I think you would be much better off having more aggressive "use once"
semantics in the page cache, so that page cache pages for streaming
writes are evicted more aggressively from cache rather than going down
the "automatic O_DIRECT" hole.

Cheers, Andreas

Post by V***@vt.edu

Post by Matthew Wilcox
What problem are you really trying to solve? Some big files hogging
the page cache?

I'm officially a storage admin. I mostly support HPC and research. As
such, I'm always looking to add tools to my toolkit. :)
(And yes, I fully recognize that *in general*, this is a Bad Idea. However,
when you've got That One Problem Data File that *should* always be access
via O_DIRECT, and *usually* is accessed via O_DIRECT, and bad things happen
if something accesses it without it (for instance, when the file is 1.5X the
actual RAM), you start looking for fixes. If you've got another, more
sustainable way to say "do not let file /X/Y/Z hog the page cache" (and
no, LD_PRELOAD isn't sustainable the way chattr is, in my book), feel free to
recommend something. :)

Cheers, Andreas

V***@vt.edu

2014-09-30 21:52:36 UTC

Post by Andreas Dilger
I think you would be much better off having more aggressive "use once"
semantics in the page cache, so that page cache pages for streaming
writes are evicted more aggressively from cache rather than going down
the "automatic O_DIRECT" hole.

Well, I'm open to convincing.. an inode bit that says "I/O for this file is
always first out of the page cache" would probably fix most of the thrashing
page cache problem (and avoid the "unexpected O_DIRECT kills the program"
issue), at the cost of a little more CPU when we turn around and evict it
from the page cache.

As long as we're at it, if we go that route we probably *also* want a
way for a program to specify it at open() time (for instance, for the
use of backup programs) - that should minimize the infamous "everything
runs like a pig after the backup finishes running because the *useful*
pages are all cache-cold".

(And yes, you really *do* want the ability in both places - one for a
program to be able to say "do this for any file I touch", and another for
the file to say "do this for any program that touches me").

Matthew - would that sort of approach make more sense to you? I admit
I originally posted only because I'd just finished fighting with a
similar issue, and code floated by that got filesystem pages into
core without trashing the page cache. I'm not at all tied to the specific
solution.. :)

Jeff Moyer

2014-10-01 15:45:47 UTC

Post by V***@vt.edu
As long as we're at it, if we go that route we probably *also* want a
way for a program to specify it at open() time (for instance, for the
use of backup programs) - that should minimize the infamous "everything
runs like a pig after the backup finishes running because the *useful*
pages are all cache-cold".

This sounds an awful lot like posix_fadvise' POSIX_FADV_NOREUSE flag.
Whether the implementation lives up to your expectations is another
matter, but at least the interface is already there.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

V***@vt.edu

2014-10-01 17:10:29 UTC

Post by Jeff Moyer
This sounds an awful lot like posix_fadvise' POSIX_FADV_NOREUSE flag.

Gaah. No wonder 'man madvise' worked but 'man fadvise' came up empty :)

-ENOCAFFIENE :)

V***@vt.edu

2014-10-01 17:17:21 UTC

Post by Jeff Moyer
This sounds an awful lot like posix_fadvise' POSIX_FADV_NOREUSE flag.

Gaah. Premature click. man posix_fadvise says this:

In kernels before 2.6.18, POSIX_FADV_NOREUSE had the same semantics as
POSIX_FADV_WILLNEED. This was probably a bug; since kernel 2.6.18,
this flag is a no-op.

and mm/fadvise.c says this:
switch (advice) {
case POSIX_FADV_NORMAL:
f.file->f_ra.ra_pages = bdi->ra_pages;
spin_lock(&f.file->f_lock);
f.file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&f.file->f_lock);
...
*/
force_page_cache_readahead(mapping, f.file, start_index,
nrpages);
break;
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping->backing_dev_info))
__filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);

/* First and last FULL page! */

So... not much interface there, actually. One wonders if removing the 'break;'
and allowing a fall-through would actually be an improvement....

Mathieu Desnoyers

2014-10-16 07:39:08 UTC

We currently have two unrelated things inside the Linux kernel called
"XIP". One allows the kernel to run out of flash without being copied
into DRAM, the other allows executables to be run without copying them
into the page cache. The latter is almost the behaviour we want for
NV-DIMMs, except that we primarily want data to be accessed through
this filesystem, not executables. We deal with the confusion between
the two XIPs by renaming the second one to DAX (short for Direct Access).

Hi Matthew,

First of all, thanks a lot for this patchset! Secondly, I must voice out
that you really need to work on your marketing skills. What your
changelog does not show is that this feature is tremendously useful
*today* in the following use-case:

- On *any* platform for which you can teach the BIOS not to clear memory
on soft reboot,
- Use a kernel argument to restrain it to portion of memory at boot
(e.g. 15GB out of 16GB),
- Create an ext4 or ext2 filesystem in this available memory area,
- Mount it with DAX flags,

From there, you can do lots of interesting stuff. In my use-case, I

would love to use it to mmap LTTng kernel/userspace tracer buffers, so
we can extract them after a soft reboot and analyze a system crash.

My recommendation would be to rename this patchset as e.g.

"DAX: Page cache bypass for in-memory persistent filesystems"

which might attract more interest from reviewers and maintainers, since
they can try it out today on commodity hardware. Also, pointing out to
ext4 specifically in the patchset introduction title does not reflect
the content accurately, since there is also ext2 implementation within
the series.

DAX bears some resemblance to its ancestor XIP but fixes many races that
were not relevant for its original use case of storing executables.
The major design change is using the filesystem's get_block routine
instead of a special-purpose ->get_xip_mem() address_space operation.
Further enhancements are planned, such as supporting huge pages, but
this is a useful amount of work to merge before adding more functionality.

Getting the simple thing in seems like a sane approach. But IMHO it
really needs to be presented as something useful on existing commodity
hardware rather than something specific requiring vendor-specific memory
and CPU extensions that will only exist in 2 years from now.

This is not the only way to support NV-DIMMs, of course. People have
written new filesystems to support them, some of which have even seen
the light of day. We believe it is valuable to support traditional
filesystems such as ext4 and XFS on NV-DIMMs in a more efficient manner
than copying the contents of the NV-DIMM to DRAM.

Indeed, I think there is value in not reinventing the wheel: having the
data persistent across reboots makes it necessary to have the same set
of FS features and tools we currently have for block devices, e.g.
consistency of the filesystem when the OS crashes, and tools to repair
the FS such as fsck.

One thing I would really like to see is a Documentation file that
explains how to setup the kernel so it leaves a memory area free at the
end of the physical address space, and how to setup a filesystem into
it. Perhaps it already exists, in this case, pointing to it in the
patchset introduction changelog would be helpful. (IOW, answering the
question: how can someone test this today on commodity hardware ?).
Also, if there are ways to setup pstore or such to achieve something
similar of a wider range of systems, it would be nice to see
documentation (or links to doc) explaining how to configure this.

I'll try to review your patchset soon, however keeping in mind that it
would be best to have mm experts having a look into it.

Thanks,

Mathieu

Patch 1 is a bug fix. It is obviously correct, and should be included
into 3.18.
Patch 2 starts the transformation by changing how ->direct_access works.
Much code is moved from the drivers and filesystems into the block
layer, and we add the flexibility of being able to map more than one
page at a time. It would be good to get this patch into 3.18 as it is
useful for people who are pursuing non-DAX approaches to working with
persistent memory.
Patch 3 is also a bug fix, probably worth including in 3.18.
Patches 4-6 are infrastructure for DAX (note that patch 6 is in the
for-next branch of Al Viro's VFS tree).
Patches 7-11 replace the XIP code with its DAX equivalents, transforming
ext2 to use the DAX code as we go. Note that patch 11 is the
Documentation patch.
Patches 12-18 clean up after the XIP code, removing the infrastructure
that is no longer needed and renaming various XIP things to DAX.
Most of these patches were added after Jan found things he didn't
like in an earlier version of the ext4 patch ... that had been copied
from ext2. So ext2 i being transformed to do things the same way that
ext4 will later. The ability to mount ext2 filesystems with the 'xip'
option is retained, although the 'dax' option is now preferred.
Patch 19 adds some DAX infrastructure to support ext4.
Patch 20 adds DAX support to ext4. It is broadly similar to ext2's DAX
support, but it is more efficient than ext4's due to its support for
unwritten extents.
Patch 21 is another cleanup patch renaming XIP to DAX.
axonram: Fix bug in direct_access
block: Change direct_access calling convention
mm: Fix XIP fault vs truncate race
mm: Allow page fault handlers to perform the COW
vfs,ext2: Introduce IS_DAX(inode)
vfs: Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
dax,ext2: Replace XIP read and write with DAX I/O
dax,ext2: Replace ext2_clear_xip_target with dax_clear_blocks
dax,ext2: Replace the XIP page fault handler with the DAX page fault
handler
dax,ext2: Replace xip_truncate_page with dax_truncate_page
dax: Replace XIP documentation with DAX documentation
vfs: Remove get_xip_mem
ext2: Remove ext2_xip_verify_sb()
ext2: Remove ext2_use_xip
ext2: Remove xip.c and xip.h
vfs,ext2: Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to
CONFIG_FS_DAX
ext2: Remove ext2_aops_xip
ext2: Get rid of most mentions of XIP in ext2
dax: Add dax_zero_page_range
brd: Rename XIP to DAX
ext4: Add DAX functionality
Documentation/filesystems/Locking | 3 -
Documentation/filesystems/dax.txt | 91 +++++++
Documentation/filesystems/ext4.txt | 2 +
Documentation/filesystems/xip.txt | 68 -----
MAINTAINERS | 6 +
arch/powerpc/sysdev/axonram.c | 19 +-
drivers/block/Kconfig | 13 +-
drivers/block/brd.c | 26 +-
drivers/s390/block/dcssblk.c | 21 +-
fs/Kconfig | 21 +-
fs/Makefile | 1 +
fs/block_dev.c | 40 +++
fs/dax.c | 532 +++++++++++++++++++++++++++++++++++++
fs/exofs/inode.c | 1 -
fs/ext2/Kconfig | 11 -
fs/ext2/Makefile | 1 -
fs/ext2/ext2.h | 10 +-
fs/ext2/file.c | 45 +++-
fs/ext2/inode.c | 38 +--
fs/ext2/namei.c | 13 +-
fs/ext2/super.c | 53 ++--
fs/ext2/xip.c | 91 -------
fs/ext2/xip.h | 26 --
fs/ext4/ext4.h | 6 +
fs/ext4/file.c | 49 +++-
fs/ext4/indirect.c | 18 +-
fs/ext4/inode.c | 89 +++++--
fs/ext4/namei.c | 10 +-
fs/ext4/super.c | 39 ++-
fs/open.c | 5 +-
include/linux/blkdev.h | 6 +-
include/linux/fs.h | 49 +++-
include/linux/mm.h | 1 +
include/linux/uio.h | 3 +
mm/Makefile | 1 -
mm/fadvise.c | 6 +-
mm/filemap.c | 25 +-
mm/filemap_xip.c | 483 ---------------------------------
mm/iov_iter.c | 237 ++++++++++++++++-
mm/madvise.c | 2 +-
mm/memory.c | 33 ++-
41 files changed, 1305 insertions(+), 889 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt
create mode 100644 fs/dax.c
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h
delete mode 100644 mm/filemap_xip.c
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Matthew Wilcox

2014-10-16 14:11:14 UTC

Post by Mathieu Desnoyers
First of all, thanks a lot for this patchset! Secondly, I must voice out
that you really need to work on your marketing skills. What your
changelog does not show is that this feature is tremendously useful
- On *any* platform for which you can teach the BIOS not to clear memory
on soft reboot,
- Use a kernel argument to restrain it to portion of memory at boot
(e.g. 15GB out of 16GB),
- Create an ext4 or ext2 filesystem in this available memory area,
- Mount it with DAX flags,

Yes, I definitely suck at technical marketing. I was thinking that
"NV-DIMMs" were the new hotness, and definitely available today, and
so advertising support for them was the best way to go. I personally
do use your use case for testing DAX, but it didn't occur to me that
it would have real-world usages.

Post by Mathieu Desnoyers

From there, you can do lots of interesting stuff. In my use-case, I

would love to use it to mmap LTTng kernel/userspace tracer buffers, so
we can extract them after a soft reboot and analyze a system crash.
My recommendation would be to rename this patchset as e.g.
"DAX: Page cache bypass for in-memory persistent filesystems"
which might attract more interest from reviewers and maintainers, since
they can try it out today on commodity hardware. Also, pointing out to
ext4 specifically in the patchset introduction title does not reflect
the content accurately, since there is also ext2 implementation within
the series.

Well ... ext2 already has the 'xip' implementation which probably works
well enough for enough of the time. Most people probably won't hit the
races it has.

Post by Mathieu Desnoyers
One thing I would really like to see is a Documentation file that
explains how to setup the kernel so it leaves a memory area free at the
end of the physical address space, and how to setup a filesystem into
it. Perhaps it already exists, in this case, pointing to it in the
patchset introduction changelog would be helpful. (IOW, answering the
question: how can someone test this today on commodity hardware ?).
Also, if there are ways to setup pstore or such to achieve something
similar of a wider range of systems, it would be nice to see
documentation (or links to doc) explaining how to configure this.

I think that documentation properly belongs to the 'pmem' block driver that
Ross has been posting. Here's 1/4, which contains some documentation,
but I think you're after something more detailed:

http://marc.info/?l=linux-fsdevel&m=140917398012020&w=2

Post by Mathieu Desnoyers
I'll try to review your patchset soon, however keeping in mind that it
would be best to have mm experts having a look into it.

Yes, mm experts have many demands on their time, unfortunately :-(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mathieu Desnoyers

2014-10-16 07:52:51 UTC

Post by Matthew Wilcox
The 'pfn' returned by axonram was completely bogus, and has been since
2008.

This should also be submitted for stable kernels. (CC

Post by Matthew Wilcox
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}
*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
return 0;
}
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Key fingerprint: 2A0B 4ED9 15F2 D3FA 45F5 B162 1728 0A97 8118 6ACF

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

80 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Matthew Wilcox 2014-09-25 20:33:18 UTC

Matthew Wilcox 2014-09-25 20:33:22 UTC

Mathieu Desnoyers 2014-10-16 09:35:17 UTC

Matthew Wilcox 2014-09-25 20:33:23 UTC

Mathieu Desnoyers 2014-10-16 13:33:55 UTC

Matthew Wilcox 2014-10-16 13:59:03 UTC

Mathieu Desnoyers 2014-10-16 14:12:06 UTC

Matthew Wilcox 2014-10-16 22:21:46 UTC

Mathieu Desnoyers 2014-10-17 15:39:37 UTC

Matthew Wilcox 2014-09-25 20:33:34 UTC

Mathieu Desnoyers 2014-10-16 12:29:08 UTC

Matthew Wilcox 2014-09-25 20:33:32 UTC

Mathieu Desnoyers 2014-10-16 12:21:15 UTC

Matthew Wilcox 2014-09-25 20:33:29 UTC

Mathieu Desnoyers 2014-10-16 12:14:46 UTC

Matthew Wilcox 2014-10-16 21:44:10 UTC

Matthew Wilcox 2014-09-25 20:33:28 UTC

Mathieu Desnoyers 2014-10-16 12:08:20 UTC

Matthew Wilcox 2014-09-25 20:33:26 UTC

Mathieu Desnoyers 2014-10-16 10:20:47 UTC

Matthew Wilcox 2014-10-16 21:29:23 UTC

Matthew Wilcox 2014-09-25 20:33:31 UTC

Mathieu Desnoyers 2014-10-16 12:20:51 UTC

Matthew Wilcox 2014-09-25 20:33:21 UTC

Mathieu Desnoyers 2014-10-16 09:12:22 UTC

Matthew Wilcox 2014-10-16 19:48:15 UTC

Mathieu Desnoyers 2014-10-17 15:35:01 UTC

Matthew Wilcox 2014-10-18 17:22:07 UTC

Matthew Wilcox 2014-09-25 20:33:19 UTC

Mathieu Desnoyers 2014-10-16 08:45:50 UTC

Matthew Wilcox 2014-10-16 19:39:21 UTC

Matthew Wilcox 2014-09-25 20:33:36 UTC

Mathieu Desnoyers 2014-10-16 12:38:24 UTC

Matthew Wilcox 2014-10-16 22:01:26 UTC

Mathieu Desnoyers 2014-10-17 15:49:39 UTC

Matthew Wilcox 2014-10-18 17:41:00 UTC

Mathieu Desnoyers 2014-10-18 21:16:23 UTC

Matthew Wilcox 2014-09-25 20:33:20 UTC

Mathieu Desnoyers 2014-10-16 08:56:42 UTC

Matthew Wilcox 2014-09-25 20:33:30 UTC

Mathieu Desnoyers 2014-10-16 12:18:02 UTC

Matthew Wilcox 2014-10-16 21:45:07 UTC

Matthew Wilcox 2014-09-25 20:33:37 UTC

Mathieu Desnoyers 2014-10-16 12:56:25 UTC

Matthew Wilcox 2014-10-16 22:16:24 UTC

Mathieu Desnoyers 2014-10-17 15:42:02 UTC

Matthew Wilcox 2014-09-25 20:33:25 UTC

Mathieu Desnoyers 2014-10-16 10:05:25 UTC

Matthew Wilcox 2014-10-16 21:22:34 UTC

Mathieu Desnoyers 2014-10-17 15:45:42 UTC

Matthew Wilcox 2014-09-25 20:33:33 UTC

Mathieu Desnoyers 2014-10-16 12:26:18 UTC

Matthew Wilcox 2014-10-16 21:52:56 UTC

Matthew Wilcox 2014-09-25 20:33:35 UTC

Mathieu Desnoyers 2014-10-16 12:32:53 UTC

Matthew Wilcox 2014-09-25 20:33:27 UTC

Mathieu Desnoyers 2014-10-16 10:28:26 UTC

Matthew Wilcox 2014-09-25 20:33:24 UTC

Mathieu Desnoyers 2014-10-16 09:50:27 UTC

Matthew Wilcox 2014-10-16 19:51:12 UTC

Matthew Wilcox 2014-10-16 22:33:31 UTC

Mathieu Desnoyers 2014-10-17 15:52:14 UTC

Matthew Wilcox 2014-09-25 20:33:38 UTC

Mathieu Desnoyers 2014-10-16 13:00:16 UTC

Matthew Wilcox 2014-09-25 20:47:49 UTC

V***@vt.edu 2014-09-30 09:45:01 UTC

Matthew Wilcox 2014-09-30 14:48:54 UTC

V***@vt.edu 2014-09-30 14:53:47 UTC

Matthew Wilcox 2014-09-30 16:08:41 UTC

Zuckerman, Boris 2014-09-30 17:10:26 UTC

Matthew Wilcox 2014-09-30 19:24:28 UTC

Zuckerman, Boris 2014-09-30 19:31:55 UTC

V***@vt.edu 2014-09-30 20:37:56 UTC

Andreas Dilger 2014-09-30 21:25:17 UTC

V***@vt.edu 2014-09-30 21:52:36 UTC

Jeff Moyer 2014-10-01 15:45:47 UTC

V***@vt.edu 2014-10-01 17:10:29 UTC

V***@vt.edu 2014-10-01 17:17:21 UTC

Mathieu Desnoyers 2014-10-16 07:39:08 UTC

Matthew Wilcox 2014-10-16 14:11:14 UTC

Mathieu Desnoyers 2014-10-16 07:52:51 UTC

about - legalese

Loading...