Discussion:
[RFC] block integrity: Fix write after checksum calculation problem
Darrick J. Wong
2011-02-22 02:00:22 UTC
Permalink
Hi all,

Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.

By my recollection, several solutions were proposed, such as retrying the I/O
with a new checksum; using the VM to track and stop page writes during I/O;
creating shadow pages so that subsequent memory writes go to a different page;
punting the integrity failure to the app to let it deal with it; and only
allowing DIF mode when O_DIRECT is enabled.

Unfortunately, I didn't get a sense that any sort of consensus had been reached
(much less anything patched into the kernel) and since I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)

Below is a simple if naive solution: (ab)use the bounce buffering code to copy
the memory page just prior to calculating the checksum, and send the copy and
the checksum to the disk controller. With this patch applied, the invalid
guard tag messages go away. An optimization would be to perform the copy only
when memory contents change, but I wanted to ask peoples' opinions before
continuing. I don't imagine bounce buffering is particularly speedy, though I
haven't noticed any corruption errors or weirdness yet.

Anyway, I'm mostly wondering: what do people think of this as a starting point
to fixing the DIF checksum problem?

--
block integrity: Fix write after integrity checksum calculation problem

The kernel does not prohibit writes to a dirty page during a write IO. This
leads to the race where a memory write occurs after the integrity data has been
calculated but before the hardware initiates the DMA operation; when this
happens, the data does not match the checksum and the disk fails the write due
to an invalid checksum.

An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot
a page's contents just prior to calculating the checksum, and sending the copy
and its checksum to the hardware. This is probably terrible for performance.

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

block/Kconfig | 1
block/blk-core.c | 2 -
block/blk-integrity.c | 2 +
fs/bio-integrity.c | 12 +++-
include/linux/bio.h | 4 +
include/linux/blkdev.h | 12 ++++
mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 154 insertions(+), 8 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 60be1e0..ed89f19 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -66,6 +66,7 @@ config BLK_DEV_BSG
If unsure, say Y.

config BLK_DEV_INTEGRITY
+ depends on BOUNCE=y
bool "Block layer data integrity support"
---help---
Some storage devices allow extra information to be
diff --git a/block/blk-core.c b/block/blk-core.c
index 3cc3bed..81a4e50 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio)
*/
blk_partition_remap(bio);

- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
+ if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio))
goto end_io;

if (old_sector != -1)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 54bcba6..24843f8 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)

BUG_ON(disk == NULL);

+ init_integrity_pool();
+
if (disk->integrity == NULL) {
bi = kmem_cache_alloc(integrity_cachep,
GFP_KERNEL | __GFP_ZERO);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index df4fdfd..9eee904 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
* block device's integrity function. In the READ case, the buffer
* will be prepared for DMA and a suitable end_io handler set up.
*/
-int bio_integrity_prep(struct bio *bio)
+int bio_integrity_prep(struct bio **pbio)
{
+ struct bio *bio;
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
struct request_queue *q;
@@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio)
unsigned int bytes, offset, i;
unsigned int sectors;

- bi = bdev_get_integrity(bio->bi_bdev);
- q = bdev_get_queue(bio->bi_bdev);
+ bi = bdev_get_integrity((*pbio)->bi_bdev);
+ q = bdev_get_queue((*pbio)->bi_bdev);
BUG_ON(bi == NULL);
- BUG_ON(bio_integrity(bio));
+ BUG_ON(bio_integrity(*pbio));
+
+ blk_queue_integrity_bounce(q, pbio);
+ bio = *pbio;

sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 36d10f0..1834934 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)

-#define bio_integrity(bio) (bio->bi_integrity != NULL)
+#define bio_integrity(bio) ((bio)->bi_integrity != NULL)

extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
@@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns
extern int bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
-extern int bio_integrity_prep(struct bio *);
+extern int bio_integrity_prep(struct bio **);
extern void bio_integrity_endio(struct bio *, int);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..1c9fc40 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
#define BLK_MIN_SG_TIMEOUT (7 * HZ)

+#ifdef CONFIG_BLK_DEV_INTEGRITY
+extern int init_integrity_pool(void);
+extern void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio);
+#else
+#define init_integrity_pool() do { } while (0);
+static inline void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio)
+{
+}
+#endif
+
#ifdef CONFIG_BOUNCE
extern int init_emergency_isa_pool(void);
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..c0ce533 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -21,7 +21,19 @@
#define POOL_SIZE 64
#define ISA_POOL_SIZE 16

-static mempool_t *page_pool, *isa_page_pool;
+static mempool_t *integrity_pool, *page_pool, *isa_page_pool;
+
+int init_integrity_pool(void)
+{
+ if (integrity_pool)
+ return 0;
+
+ integrity_pool = mempool_create_page_pool(POOL_SIZE, 0);
+ BUG_ON(!integrity_pool);
+ printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE);
+
+ return 0;
+}

#ifdef CONFIG_HIGHMEM
static __init int init_emergency_pool(void)
@@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err)
bounce_end_io(bio, page_pool, err);
}

+static void bounce_end_integrity_io_write(struct bio *bio, int err)
+{
+ bounce_end_io(bio, integrity_pool, err);
+}
+
static void bounce_end_io_write_isa(struct bio *bio, int err)
{

@@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
}

EXPORT_SYMBOL(blk_queue_bounce);
+
+/*
+ * Fix the problem of pages getting dirtied after the integrity checksum
+ * calculation by copying all writes to a shadow buffer prior to computing
+ * checksums.
+ */
+static void __blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio_orig,
+ mempool_t *pool)
+{
+ struct page *page;
+ struct bio *bio = NULL;
+ int i;
+ struct bio_vec *to, *from;
+
+ bio_for_each_segment(from, *bio_orig, i) {
+ char *vto, *vfrom;
+ page = from->bv_page;
+
+ /*
+ * irk, bounce it
+ */
+ if (!bio) {
+ unsigned int cnt = (*bio_orig)->bi_vcnt;
+
+ bio = bio_alloc(GFP_NOIO, cnt);
+ memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec));
+ }
+
+
+ to = bio->bi_io_vec + i;
+
+ to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ inc_zone_page_state(to->bv_page, NR_BOUNCE);
+
+ flush_dcache_page(from->bv_page);
+ vto = page_address(to->bv_page) + to->bv_offset;
+ vfrom = kmap(from->bv_page) + from->bv_offset;
+ memcpy(vto, vfrom, to->bv_len);
+ kunmap(from->bv_page);
+ }
+
+ /*
+ * no pages bounced
+ */
+ if (!bio)
+ return;
+
+ trace_block_bio_bounce(q, *bio_orig);
+
+ /*
+ * at least one page was bounced, fill in possible non-highmem
+ * pages
+ */
+ __bio_for_each_segment(from, *bio_orig, i, 0) {
+ to = bio_iovec_idx(bio, i);
+ if (!to->bv_page) {
+ to->bv_page = from->bv_page;
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ }
+ }
+
+ bio->bi_bdev = (*bio_orig)->bi_bdev;
+ bio->bi_flags |= (1 << BIO_BOUNCED);
+ bio->bi_sector = (*bio_orig)->bi_sector;
+ bio->bi_rw = (*bio_orig)->bi_rw;
+
+ bio->bi_vcnt = (*bio_orig)->bi_vcnt;
+ bio->bi_idx = (*bio_orig)->bi_idx;
+ bio->bi_size = (*bio_orig)->bi_size;
+
+ if (pool == integrity_pool)
+ bio->bi_end_io = bounce_end_integrity_io_write;
+ else
+ bio->bi_end_io = bounce_end_io_write_isa;
+
+ bio->bi_private = *bio_orig;
+ *bio_orig = bio;
+}
+
+void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig)
+{
+ mempool_t *pool;
+
+ /* only do this for writes */
+ if (bio_data_dir(*bio_orig) != WRITE)
+ return;
+ /*
+ * Data-less bio, nothing to bounce
+ */
+ if (!bio_has_data(*bio_orig))
+ return;
+
+ if (!(q->bounce_gfp & GFP_DMA)) {
+ BUG_ON(!integrity_pool);
+ pool = integrity_pool;
+ } else {
+ BUG_ON(!isa_page_pool);
+ pool = isa_page_pool;
+ }
+
+ /*
+ * slow path
+ */
+ __blk_queue_integrity_bounce(q, bio_orig, pool);
+}
+EXPORT_SYMBOL(blk_queue_integrity_bounce);
Boaz Harrosh
2011-02-22 05:45:51 UTC
Permalink
Post by Darrick J. Wong
Hi all,
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
of ext4 it should work much better. (If you still have problems please report
them, those FSs advertise stable pages write-out)

This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
and syncing with write-out for example by taking the page-lock. Currently each
FS is to itself because in VFS it would force the behaviour on FSs that it does
not make sense to.

Note that the proper solution does not copy any data, just forces the app to
wait before changing write-out pages.

Boaz
Post by Darrick J. Wong
By my recollection, several solutions were proposed, such as retrying the I/O
with a new checksum; using the VM to track and stop page writes during I/O;
creating shadow pages so that subsequent memory writes go to a different page;
punting the integrity failure to the app to let it deal with it; and only
allowing DIF mode when O_DIRECT is enabled.
Unfortunately, I didn't get a sense that any sort of consensus had been reached
(much less anything patched into the kernel) and since I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Below is a simple if naive solution: (ab)use the bounce buffering code to copy
the memory page just prior to calculating the checksum, and send the copy and
the checksum to the disk controller. With this patch applied, the invalid
guard tag messages go away. An optimization would be to perform the copy only
when memory contents change, but I wanted to ask peoples' opinions before
continuing. I don't imagine bounce buffering is particularly speedy, though I
haven't noticed any corruption errors or weirdness yet.
Anyway, I'm mostly wondering: what do people think of this as a starting point
to fixing the DIF checksum problem?
--
block integrity: Fix write after integrity checksum calculation problem
The kernel does not prohibit writes to a dirty page during a write IO. This
leads to the race where a memory write occurs after the integrity data has been
calculated but before the hardware initiates the DMA operation; when this
happens, the data does not match the checksum and the disk fails the write due
to an invalid checksum.
An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot
a page's contents just prior to calculating the checksum, and sending the copy
and its checksum to the hardware. This is probably terrible for performance.
---
block/Kconfig | 1
block/blk-core.c | 2 -
block/blk-integrity.c | 2 +
fs/bio-integrity.c | 12 +++-
include/linux/bio.h | 4 +
include/linux/blkdev.h | 12 ++++
mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 154 insertions(+), 8 deletions(-)
diff --git a/block/Kconfig b/block/Kconfig
index 60be1e0..ed89f19 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -66,6 +66,7 @@ config BLK_DEV_BSG
If unsure, say Y.
config BLK_DEV_INTEGRITY
+ depends on BOUNCE=y
bool "Block layer data integrity support"
---help---
Some storage devices allow extra information to be
diff --git a/block/blk-core.c b/block/blk-core.c
index 3cc3bed..81a4e50 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio)
*/
blk_partition_remap(bio);
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
+ if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio))
goto end_io;
if (old_sector != -1)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 54bcba6..24843f8 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
BUG_ON(disk == NULL);
+ init_integrity_pool();
+
if (disk->integrity == NULL) {
bi = kmem_cache_alloc(integrity_cachep,
GFP_KERNEL | __GFP_ZERO);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index df4fdfd..9eee904 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
* block device's integrity function. In the READ case, the buffer
* will be prepared for DMA and a suitable end_io handler set up.
*/
-int bio_integrity_prep(struct bio *bio)
+int bio_integrity_prep(struct bio **pbio)
{
+ struct bio *bio;
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
struct request_queue *q;
@@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio)
unsigned int bytes, offset, i;
unsigned int sectors;
- bi = bdev_get_integrity(bio->bi_bdev);
- q = bdev_get_queue(bio->bi_bdev);
+ bi = bdev_get_integrity((*pbio)->bi_bdev);
+ q = bdev_get_queue((*pbio)->bi_bdev);
BUG_ON(bi == NULL);
- BUG_ON(bio_integrity(bio));
+ BUG_ON(bio_integrity(*pbio));
+
+ blk_queue_integrity_bounce(q, pbio);
+ bio = *pbio;
sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 36d10f0..1834934 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
-#define bio_integrity(bio) (bio->bi_integrity != NULL)
+#define bio_integrity(bio) ((bio)->bi_integrity != NULL)
extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
@@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns
extern int bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
-extern int bio_integrity_prep(struct bio *);
+extern int bio_integrity_prep(struct bio **);
extern void bio_integrity_endio(struct bio *, int);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..1c9fc40 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+extern int init_integrity_pool(void);
+extern void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio);
+#else
+#define init_integrity_pool() do { } while (0);
+static inline void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio)
+{
+}
+#endif
+
#ifdef CONFIG_BOUNCE
extern int init_emergency_isa_pool(void);
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..c0ce533 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -21,7 +21,19 @@
#define POOL_SIZE 64
#define ISA_POOL_SIZE 16
-static mempool_t *page_pool, *isa_page_pool;
+static mempool_t *integrity_pool, *page_pool, *isa_page_pool;
+
+int init_integrity_pool(void)
+{
+ if (integrity_pool)
+ return 0;
+
+ integrity_pool = mempool_create_page_pool(POOL_SIZE, 0);
+ BUG_ON(!integrity_pool);
+ printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE);
+
+ return 0;
+}
#ifdef CONFIG_HIGHMEM
static __init int init_emergency_pool(void)
@@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err)
bounce_end_io(bio, page_pool, err);
}
+static void bounce_end_integrity_io_write(struct bio *bio, int err)
+{
+ bounce_end_io(bio, integrity_pool, err);
+}
+
static void bounce_end_io_write_isa(struct bio *bio, int err)
{
@@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
}
EXPORT_SYMBOL(blk_queue_bounce);
+
+/*
+ * Fix the problem of pages getting dirtied after the integrity checksum
+ * calculation by copying all writes to a shadow buffer prior to computing
+ * checksums.
+ */
+static void __blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio_orig,
+ mempool_t *pool)
+{
+ struct page *page;
+ struct bio *bio = NULL;
+ int i;
+ struct bio_vec *to, *from;
+
+ bio_for_each_segment(from, *bio_orig, i) {
+ char *vto, *vfrom;
+ page = from->bv_page;
+
+ /*
+ * irk, bounce it
+ */
+ if (!bio) {
+ unsigned int cnt = (*bio_orig)->bi_vcnt;
+
+ bio = bio_alloc(GFP_NOIO, cnt);
+ memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec));
+ }
+
+
+ to = bio->bi_io_vec + i;
+
+ to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ inc_zone_page_state(to->bv_page, NR_BOUNCE);
+
+ flush_dcache_page(from->bv_page);
+ vto = page_address(to->bv_page) + to->bv_offset;
+ vfrom = kmap(from->bv_page) + from->bv_offset;
+ memcpy(vto, vfrom, to->bv_len);
+ kunmap(from->bv_page);
+ }
+
+ /*
+ * no pages bounced
+ */
+ if (!bio)
+ return;
+
+ trace_block_bio_bounce(q, *bio_orig);
+
+ /*
+ * at least one page was bounced, fill in possible non-highmem
+ * pages
+ */
+ __bio_for_each_segment(from, *bio_orig, i, 0) {
+ to = bio_iovec_idx(bio, i);
+ if (!to->bv_page) {
+ to->bv_page = from->bv_page;
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ }
+ }
+
+ bio->bi_bdev = (*bio_orig)->bi_bdev;
+ bio->bi_flags |= (1 << BIO_BOUNCED);
+ bio->bi_sector = (*bio_orig)->bi_sector;
+ bio->bi_rw = (*bio_orig)->bi_rw;
+
+ bio->bi_vcnt = (*bio_orig)->bi_vcnt;
+ bio->bi_idx = (*bio_orig)->bi_idx;
+ bio->bi_size = (*bio_orig)->bi_size;
+
+ if (pool == integrity_pool)
+ bio->bi_end_io = bounce_end_integrity_io_write;
+ else
+ bio->bi_end_io = bounce_end_io_write_isa;
+
+ bio->bi_private = *bio_orig;
+ *bio_orig = bio;
+}
+
+void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig)
+{
+ mempool_t *pool;
+
+ /* only do this for writes */
+ if (bio_data_dir(*bio_orig) != WRITE)
+ return;
+ /*
+ * Data-less bio, nothing to bounce
+ */
+ if (!bio_has_data(*bio_orig))
+ return;
+
+ if (!(q->bounce_gfp & GFP_DMA)) {
+ BUG_ON(!integrity_pool);
+ pool = integrity_pool;
+ } else {
+ BUG_ON(!isa_page_pool);
+ pool = isa_page_pool;
+ }
+
+ /*
+ * slow path
+ */
+ __blk_queue_integrity_bounce(q, bio_orig, pool);
+}
+EXPORT_SYMBOL(blk_queue_integrity_bounce);
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kara
2011-02-22 11:42:22 UTC
Permalink
Hi Boaz,
Post by Boaz Harrosh
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
of ext4 it should work much better. (If you still have problems please report
them, those FSs advertise stable pages write-out)
Do they? I've just checked ext4 and xfs and they don't seem to enforce
stable pages. They do lock the page (which implicitely happens in mm code
for any filesystem BTW) but this is not enough. You have to wait for
PageWriteback to get cleared and only btrfs does that.
Post by Boaz Harrosh
This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
and syncing with write-out for example by taking the page-lock. Currently each
FS is to itself because in VFS it would force the behaviour on FSs that it does
not make sense to.
Yes, it's easy to fix but at a performance cost for any application doing
frequent rewrites regardless whether integrity features are used or not.
And I don't think that's a good thing. I even remember someone measured the
hit last time this came up and it was rather noticeable.
Post by Boaz Harrosh
Note that the proper solution does not copy any data, just forces the app to
wait before changing write-out pages.
I think that's up for discussion. In fact what is going to be faster
depends pretty much on your system config. If you have enough CPU/RAM
bandwidth compared to storage speed, you're better of doing copying. If
you can barely saturate storage with your CPU/RAM, waiting is probably
better for you.

Moreover if you do data copyout, you push the performance cost only on
users of the integrity feature which is nice. But on the other hand users
of integrity take the cost even if they are not doing rewrites.

A solution which is technically plausible and penalizing only rewrites
of data-integrity protected pages would be a use of shadow pages as Darrick
describes below. So I'd lean towards that long term. But for now I think
Darrick's solution is OK to make the integrity feature actually useful and
later someone can try something more clever.

Honza
Post by Boaz Harrosh
Post by Darrick J. Wong
By my recollection, several solutions were proposed, such as retrying the I/O
with a new checksum; using the VM to track and stop page writes during I/O;
creating shadow pages so that subsequent memory writes go to a different page;
punting the integrity failure to the app to let it deal with it; and only
allowing DIF mode when O_DIRECT is enabled.
Unfortunately, I didn't get a sense that any sort of consensus had been reached
(much less anything patched into the kernel) and since I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Below is a simple if naive solution: (ab)use the bounce buffering code to copy
the memory page just prior to calculating the checksum, and send the copy and
the checksum to the disk controller. With this patch applied, the invalid
guard tag messages go away. An optimization would be to perform the copy only
when memory contents change, but I wanted to ask peoples' opinions before
continuing. I don't imagine bounce buffering is particularly speedy, though I
haven't noticed any corruption errors or weirdness yet.
Anyway, I'm mostly wondering: what do people think of this as a starting point
to fixing the DIF checksum problem?
--
block integrity: Fix write after integrity checksum calculation problem
The kernel does not prohibit writes to a dirty page during a write IO. This
leads to the race where a memory write occurs after the integrity data has been
calculated but before the hardware initiates the DMA operation; when this
happens, the data does not match the checksum and the disk fails the write due
to an invalid checksum.
An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot
a page's contents just prior to calculating the checksum, and sending the copy
and its checksum to the hardware. This is probably terrible for performance.
---
block/Kconfig | 1
block/blk-core.c | 2 -
block/blk-integrity.c | 2 +
fs/bio-integrity.c | 12 +++-
include/linux/bio.h | 4 +
include/linux/blkdev.h | 12 ++++
mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 154 insertions(+), 8 deletions(-)
diff --git a/block/Kconfig b/block/Kconfig
index 60be1e0..ed89f19 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -66,6 +66,7 @@ config BLK_DEV_BSG
If unsure, say Y.
config BLK_DEV_INTEGRITY
+ depends on BOUNCE=y
bool "Block layer data integrity support"
---help---
Some storage devices allow extra information to be
diff --git a/block/blk-core.c b/block/blk-core.c
index 3cc3bed..81a4e50 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio)
*/
blk_partition_remap(bio);
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
+ if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio))
goto end_io;
if (old_sector != -1)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 54bcba6..24843f8 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
BUG_ON(disk == NULL);
+ init_integrity_pool();
+
if (disk->integrity == NULL) {
bi = kmem_cache_alloc(integrity_cachep,
GFP_KERNEL | __GFP_ZERO);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index df4fdfd..9eee904 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
* block device's integrity function. In the READ case, the buffer
* will be prepared for DMA and a suitable end_io handler set up.
*/
-int bio_integrity_prep(struct bio *bio)
+int bio_integrity_prep(struct bio **pbio)
{
+ struct bio *bio;
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
struct request_queue *q;
@@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio)
unsigned int bytes, offset, i;
unsigned int sectors;
- bi = bdev_get_integrity(bio->bi_bdev);
- q = bdev_get_queue(bio->bi_bdev);
+ bi = bdev_get_integrity((*pbio)->bi_bdev);
+ q = bdev_get_queue((*pbio)->bi_bdev);
BUG_ON(bi == NULL);
- BUG_ON(bio_integrity(bio));
+ BUG_ON(bio_integrity(*pbio));
+
+ blk_queue_integrity_bounce(q, pbio);
+ bio = *pbio;
sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 36d10f0..1834934 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
-#define bio_integrity(bio) (bio->bi_integrity != NULL)
+#define bio_integrity(bio) ((bio)->bi_integrity != NULL)
extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
@@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns
extern int bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
-extern int bio_integrity_prep(struct bio *);
+extern int bio_integrity_prep(struct bio **);
extern void bio_integrity_endio(struct bio *, int);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..1c9fc40 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+extern int init_integrity_pool(void);
+extern void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio);
+#else
+#define init_integrity_pool() do { } while (0);
+static inline void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio)
+{
+}
+#endif
+
#ifdef CONFIG_BOUNCE
extern int init_emergency_isa_pool(void);
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..c0ce533 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -21,7 +21,19 @@
#define POOL_SIZE 64
#define ISA_POOL_SIZE 16
-static mempool_t *page_pool, *isa_page_pool;
+static mempool_t *integrity_pool, *page_pool, *isa_page_pool;
+
+int init_integrity_pool(void)
+{
+ if (integrity_pool)
+ return 0;
+
+ integrity_pool = mempool_create_page_pool(POOL_SIZE, 0);
+ BUG_ON(!integrity_pool);
+ printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE);
+
+ return 0;
+}
#ifdef CONFIG_HIGHMEM
static __init int init_emergency_pool(void)
@@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err)
bounce_end_io(bio, page_pool, err);
}
+static void bounce_end_integrity_io_write(struct bio *bio, int err)
+{
+ bounce_end_io(bio, integrity_pool, err);
+}
+
static void bounce_end_io_write_isa(struct bio *bio, int err)
{
@@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
}
EXPORT_SYMBOL(blk_queue_bounce);
+
+/*
+ * Fix the problem of pages getting dirtied after the integrity checksum
+ * calculation by copying all writes to a shadow buffer prior to computing
+ * checksums.
+ */
+static void __blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio_orig,
+ mempool_t *pool)
+{
+ struct page *page;
+ struct bio *bio = NULL;
+ int i;
+ struct bio_vec *to, *from;
+
+ bio_for_each_segment(from, *bio_orig, i) {
+ char *vto, *vfrom;
+ page = from->bv_page;
+
+ /*
+ * irk, bounce it
+ */
+ if (!bio) {
+ unsigned int cnt = (*bio_orig)->bi_vcnt;
+
+ bio = bio_alloc(GFP_NOIO, cnt);
+ memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec));
+ }
+
+
+ to = bio->bi_io_vec + i;
+
+ to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ inc_zone_page_state(to->bv_page, NR_BOUNCE);
+
+ flush_dcache_page(from->bv_page);
+ vto = page_address(to->bv_page) + to->bv_offset;
+ vfrom = kmap(from->bv_page) + from->bv_offset;
+ memcpy(vto, vfrom, to->bv_len);
+ kunmap(from->bv_page);
+ }
+
+ /*
+ * no pages bounced
+ */
+ if (!bio)
+ return;
+
+ trace_block_bio_bounce(q, *bio_orig);
+
+ /*
+ * at least one page was bounced, fill in possible non-highmem
+ * pages
+ */
+ __bio_for_each_segment(from, *bio_orig, i, 0) {
+ to = bio_iovec_idx(bio, i);
+ if (!to->bv_page) {
+ to->bv_page = from->bv_page;
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ }
+ }
+
+ bio->bi_bdev = (*bio_orig)->bi_bdev;
+ bio->bi_flags |= (1 << BIO_BOUNCED);
+ bio->bi_sector = (*bio_orig)->bi_sector;
+ bio->bi_rw = (*bio_orig)->bi_rw;
+
+ bio->bi_vcnt = (*bio_orig)->bi_vcnt;
+ bio->bi_idx = (*bio_orig)->bi_idx;
+ bio->bi_size = (*bio_orig)->bi_size;
+
+ if (pool == integrity_pool)
+ bio->bi_end_io = bounce_end_integrity_io_write;
+ else
+ bio->bi_end_io = bounce_end_io_write_isa;
+
+ bio->bi_private = *bio_orig;
+ *bio_orig = bio;
+}
+
+void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig)
+{
+ mempool_t *pool;
+
+ /* only do this for writes */
+ if (bio_data_dir(*bio_orig) != WRITE)
+ return;
+ /*
+ * Data-less bio, nothing to bounce
+ */
+ if (!bio_has_data(*bio_orig))
+ return;
+
+ if (!(q->bounce_gfp & GFP_DMA)) {
+ BUG_ON(!integrity_pool);
+ pool = integrity_pool;
+ } else {
+ BUG_ON(!isa_page_pool);
+ pool = isa_page_pool;
+ }
+
+ /*
+ * slow path
+ */
+ __blk_queue_integrity_bounce(q, bio_orig, pool);
+}
+EXPORT_SYMBOL(blk_queue_integrity_bounce);
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
Chris Mason
2011-02-22 13:02:10 UTC
Permalink
[ resend sorry if you get this twice ]
Post by Jan Kara
Hi Boaz,
Post by Boaz Harrosh
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
of ext4 it should work much better. (If you still have problems please report
them, those FSs advertise stable pages write-out)
Do they? I've just checked ext4 and xfs and they don't seem to enforce
stable pages. They do lock the page (which implicitely happens in mm code
for any filesystem BTW) but this is not enough. You have to wait for
PageWriteback to get cleared and only btrfs does that.
Post by Boaz Harrosh
This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
and syncing with write-out for example by taking the page-lock. Currently each
FS is to itself because in VFS it would force the behaviour on FSs that it does
not make sense to.
Yes, it's easy to fix but at a performance cost for any application doing
frequent rewrites regardless whether integrity features are used or not.
And I don't think that's a good thing. I even remember someone measured the
hit last time this came up and it was rather noticeable.
Do you remember which workload this was? I do remember someone
mentioning a specific workload, but can't recall which one now. fsx is
definitely slower when we wait for writeback, but that's because it's
all evil inside.
Post by Jan Kara
Post by Boaz Harrosh
Note that the proper solution does not copy any data, just forces the app to
wait before changing write-out pages.
I think that's up for discussion. In fact what is going to be faster
depends pretty much on your system config. If you have enough CPU/RAM
bandwidth compared to storage speed, you're better of doing copying. If
you can barely saturate storage with your CPU/RAM, waiting is probably
better for you.
Moreover if you do data copyout, you push the performance cost only on
users of the integrity feature which is nice. But on the other hand users
of integrity take the cost even if they are not doing rewrites.
A solution which is technically plausible and penalizing only rewrites
of data-integrity protected pages would be a use of shadow pages as Darrick
describes below. So I'd lean towards that long term. But for now I think
Darrick's solution is OK to make the integrity feature actually useful and
later someone can try something more clever.
Rewrites in flight should be very rare though, and I think the bouncing
is going to have a big impact on the intended workloads. It's not just
the cost of the copy, it's also the increased time as we beat on the
page allocator.

We're working on adding stable pages to ext34 and other filesystems
missing it. When the work is done we can benchmark and decide on the
tradeoffs.

-chris
Boaz Harrosh
2011-02-22 19:13:59 UTC
Permalink
Post by Chris Mason
[ resend sorry if you get this twice ]
Post by Jan Kara
Hi Boaz,
Post by Boaz Harrosh
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
of ext4 it should work much better. (If you still have problems please report
them, those FSs advertise stable pages write-out)
Do they? I've just checked ext4 and xfs and they don't seem to enforce
stable pages. They do lock the page (which implicitely happens in mm code
for any filesystem BTW) but this is not enough. You have to wait for
PageWriteback to get cleared and only btrfs does that.
Post by Boaz Harrosh
This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
and syncing with write-out for example by taking the page-lock. Currently each
FS is to itself because in VFS it would force the behaviour on FSs that it does
not make sense to.
Yes, it's easy to fix but at a performance cost for any application doing
frequent rewrites regardless whether integrity features are used or not.
And I don't think that's a good thing. I even remember someone measured the
hit last time this came up and it was rather noticeable.
Do you remember which workload this was? I do remember someone
mentioning a specific workload, but can't recall which one now. fsx is
definitely slower when we wait for writeback, but that's because it's
all evil inside.
Me too
I have been asking on many occasions on multiple mailing lists and LSF last
year, if any one did any benchmarks on this issue, and no one came forward.
So please if someone is hiding his resaults come and show us we want to see?

I've been Playing with this for my raid5 code and Actually the problem
is not that bad. I'm using tar to exercise that. What I can see that
I can actually wait on a single page every once in like 15-30 seconds
but never consecutively on multiple pages. Because usually you wait on
an IOed page but all the pages in an IO are completed together and the
reset of the modifications go through.

I could not find any measurable performance difference, it was all well inside
the margin of the test. But maybe tar is just bad test.
Post by Chris Mason
Post by Jan Kara
Post by Boaz Harrosh
Note that the proper solution does not copy any data, just forces the app to
wait before changing write-out pages.
I think that's up for discussion. In fact what is going to be faster
depends pretty much on your system config. If you have enough CPU/RAM
bandwidth compared to storage speed, you're better of doing copying. If
you can barely saturate storage with your CPU/RAM, waiting is probably
better for you.
Moreover if you do data copyout, you push the performance cost only on
users of the integrity feature which is nice. But on the other hand users
of integrity take the cost even if they are not doing rewrites.
A solution which is technically plausible and penalizing only rewrites
of data-integrity protected pages would be a use of shadow pages as Darrick
describes below. So I'd lean towards that long term. But for now I think
Darrick's solution is OK to make the integrity feature actually useful and
later someone can try something more clever.
Rewrites in flight should be very rare though, and I think the bouncing
is going to have a big impact on the intended workloads. It's not just
the cost of the copy, it's also the increased time as we beat on the
page allocator.
We're working on adding stable pages to ext34 and other filesystems
missing it. When the work is done we can benchmark and decide on the
tradeoffs.
Thanks, that would be nice. I bet the simple wait on write-out will beat
copy-everything , any day of the week.

Why don't we put some flags on the BDI that requests stable write-out
and devices with diff enabled or the likes of raid1/4/5/6 can turn
it on? (Also iscsi integrity checks as well)
Post by Chris Mason
-chris
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-03-04 20:51:43 UTC
Permalink
Post by Jan Kara
Hi Boaz,
Post by Boaz Harrosh
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
of ext4 it should work much better. (If you still have problems please report
them, those FSs advertise stable pages write-out)
Do they? I've just checked ext4 and xfs and they don't seem to enforce
stable pages. They do lock the page (which implicitely happens in mm code
for any filesystem BTW) but this is not enough. You have to wait for
PageWriteback to get cleared and only btrfs does that.
Post by Boaz Harrosh
This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
and syncing with write-out for example by taking the page-lock. Currently each
FS is to itself because in VFS it would force the behaviour on FSs that it does
not make sense to.
Yes, it's easy to fix but at a performance cost for any application doing
frequent rewrites regardless whether integrity features are used or not.
And I don't think that's a good thing. I even remember someone measured the
hit last time this came up and it was rather noticeable.
Post by Boaz Harrosh
Note that the proper solution does not copy any data, just forces the app to
wait before changing write-out pages.
I think that's up for discussion. In fact what is going to be faster
depends pretty much on your system config. If you have enough CPU/RAM
bandwidth compared to storage speed, you're better of doing copying. If
you can barely saturate storage with your CPU/RAM, waiting is probably
better for you.
Moreover if you do data copyout, you push the performance cost only on
users of the integrity feature which is nice. But on the other hand users
of integrity take the cost even if they are not doing rewrites.
A solution which is technically plausible and penalizing only rewrites
of data-integrity protected pages would be a use of shadow pages as Darrick
describes below. So I'd lean towards that long term. But for now I think
Darrick's solution is OK to make the integrity feature actually useful and
later someone can try something more clever.
Hmm. Any interest in pushing the page copy patch as an interim solution while
I work on getting the wait-on-writeback strategy to function? I agree it's not
the fastest solution, but at least it won't be running broken while I find the
faster solution(s).

(More on that writeback patch in a short while.)

--D
Christoph Hellwig
2011-03-04 20:53:15 UTC
Permalink
Post by Darrick J. Wong
Hmm. Any interest in pushing the page copy patch as an interim solution while
I work on getting the wait-on-writeback strategy to function? I agree it's not
the fastest solution, but at least it won't be running broken while I find the
faster solution(s).
It's not only slow but also butt ugly. NAK from me - we need to get
this completely right, not hack around it.
Andreas Dilger
2011-02-22 16:13:49 UTC
Permalink
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Can you please attach your reproducer? IIRC it needed mmap() to hit this problem? Did you measure CPU usage during your testing?
Post by Darrick J. Wong
Below is a simple if naive solution: (ab)use the bounce buffering code to copy
the memory page just prior to calculating the checksum, and send the copy and
the checksum to the disk controller. With this patch applied, the invalid
guard tag messages go away. An optimization would be to perform the copy only
when memory contents change, but I wanted to ask peoples' opinions before
continuing. I don't imagine bounce buffering is particularly speedy, though I
haven't noticed any corruption errors or weirdness yet.
I don't like adding a data copy in the IO path at all. We are just looking to enable T10 DIF for Lustre and this would definitely hurt performance significantly, even though it isn't needed there at all (since the server side has proper locking of the pages to prevent multiple writers to the same page).
Post by Darrick J. Wong
Anyway, I'm mostly wondering: what do people think of this as a starting point
to fixing the DIF checksum problem?
I'd definitely prefer that the filesystem be in charge of deciding whether this is needed or not. If the use of the data copy can be constrained to only the minimum required cases (e.g. if fs checks for rewrite on page that is marked as Writeback and either copies or blocks until writeback is complete, as a mount option) that would be better. At that point we can compare on different hardware whether copying or blocking should be the default.

Cheers, Andreas
Martin K. Petersen
2011-02-22 16:40:51 UTC
Permalink
Andreas> I don't like adding a data copy in the IO path at all.

No thanks!


Andreas> I'd definitely prefer that the filesystem be in charge of
Andreas> deciding whether this is needed or not.

Absolutely. At the block layer we obviously have no idea whether the
filesystem is safe or not. So in my current tree the protection is only
generated if the relevant bio flag is set (unless the application
already added it, obviously).


Andreas> If the use of the data copy can be constrained to only the
Andreas> minimum required cases (e.g. if fs checks for rewrite on page
Andreas> that is marked as Writeback and either copies or blocks until
Andreas> writeback is complete, as a mount option) that would be
Andreas> better. At that point we can compare on different hardware
Andreas> whether copying or blocking should be the default.

Agreed.

As Chris mentioned we've got somebody on our team working through the
filesystem issues now. I'm hoping we can provide a status update at
LSF2011.
--
Martin K. Petersen Oracle Linux Engineering
Darrick J. Wong
2011-02-22 19:45:38 UTC
Permalink
Post by Andreas Dilger
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Can you please attach your reproducer? IIRC it needed mmap() to hit this
problem? Did you measure CPU usage during your testing?
I didn't need mmap; a lot of threads using write() was enough. (The reproducer
program does have a mmap mode though). Basically it creates a lot of threads
to write small blobs to random offsets in a file, with optional mmap, dio, and
sync options.

CPU usage was 100% the entire time since there were 64 threads running on 4
CPUs. With just write() mode about half that was userspace and the other half
was kernel. With write and mmap the balance became closer to 80/20.

The program is attached below. It can be built with a simple "cc -o wac wac.c".

The invocation that I was using to produce the errors was:

./wac -l65536 -n64 -r /mnt/junk

This creates a file that is 64K (16 pages) long and starts 64 threads that will
write() a small buffer and then call sync_file_range to force IO to happen.
With that I get a steady stream of DIF checksum errors on the console. If -r
is omitted they happen about once every commit= seconds.
Post by Andreas Dilger
Post by Darrick J. Wong
Below is a simple if naive solution: (ab)use the bounce buffering code to copy
the memory page just prior to calculating the checksum, and send the copy and
the checksum to the disk controller. With this patch applied, the invalid
guard tag messages go away. An optimization would be to perform the copy only
when memory contents change, but I wanted to ask peoples' opinions before
continuing. I don't imagine bounce buffering is particularly speedy, though I
haven't noticed any corruption errors or weirdness yet.
I don't like adding a data copy in the IO path at all. We are just looking to
enable T10 DIF for Lustre and this would definitely hurt performance
significantly, even though it isn't needed there at all (since the server
side has proper locking of the pages to prevent multiple writers to the same
page).
Post by Darrick J. Wong
Anyway, I'm mostly wondering: what do people think of this as a starting point
to fixing the DIF checksum problem?
I'd definitely prefer that the filesystem be in charge of deciding whether
this is needed or not. If the use of the data copy can be constrained to only
the minimum required cases (e.g. if fs checks for rewrite on page that is
marked as Writeback and either copies or blocks until writeback is complete,
as a mount option) that would be better. At that point we can compare on
different hardware whether copying or blocking should be the default.
Agreed. I too am curious to study which circumstances favor copying vs
blocking.
Post by Andreas Dilger
Cheers, Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
/*
* Write-After-Checksum reproducer(?) program
* Copyright (C) 2011 IBM. All rights reserved.
* This program is licensed under the GPLv2.
*/
#define _XOPEN_SOURCE 600
#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <string.h>
#define __USE_GNU
#include <fcntl.h>
#include <errno.h>
#include <inttypes.h>

#define SYNC_RANGE 1
#define SYNC_FILE 2

#define DEFAULT_BUFSIZE 4096
static uint32_t bufsize = DEFAULT_BUFSIZE;

void help(const char *pname)
{
printf("Usage: %s [-n threads] [-m threads] [-d] [-b blocksize] [-r] [-f] -l filesize filename\n", pname);
printf("-b Size of a memory page.\n");
printf("-d Use direct I/O.\n");
printf("-l Desired file size.\n");
printf("-n Use this many write() threads.\n");
printf("-m Use this many mmap write threads.\n");
printf("-s Synchronous disk writes.\n");
printf("-r Use sync_file_range after write.\n");
printf("-f fsync after write.\n");
}

int seed_random(void) {
int fp;
long seed;

fp = open("/dev/urandom", O_RDONLY);
if (fp < 0) {
perror("/dev/urandom");
return 0;
}

if (read(fp, &seed, sizeof(seed)) != sizeof(seed)) {
perror("read random seed");
return 0;
}

close(fp);
srand(seed);

return 1;
}

uint64_t get_randnum(uint64_t min, uint64_t max) {
return (min + (uint64_t)((double)(max - min) * (rand() / (RAND_MAX + 1.0))));
}

static uint64_t get_randnum_align(uint64_t min, uint64_t max, uint64_t align) {
return (min + (uint64_t)((double)(max - min) * (rand() / (RAND_MAX + 1.0)))) &
~(align - 1);
}

int write_junk(const char *fname, int flags, int sync_options, uint64_t file_size)
{
int fd, len;
uint64_t offset, generation = 0;
char *buf;

len = posix_memalign((void **)&buf, bufsize, bufsize);
if (len) {
errno = len;
perror("alloc");
return 66;
}

fd = open(fname, flags | O_WRONLY);
if (fd < 1) {
perror(fname);
return 64;
}

while (1) {
len = snprintf(buf, bufsize - 1, "%d - %"PRIu64, getpid(), generation++);
if (flags & O_DIRECT) {
len = bufsize;
offset = get_randnum_align(0, file_size - len, bufsize);
} else {
offset = get_randnum(0, file_size - len);
}

if (pwrite(fd, buf, len, offset) < 0) {
perror("pwrite");
close(fd);
free(buf);
return 65;
}
if ((sync_options & SYNC_RANGE) && sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER) < 0) {
perror("sync_file_range");
close(fd);
free(buf);
return 67;
}
if ((sync_options & SYNC_FILE) && fsync(fd)) {
perror("fsync");
close(fd);
free(buf);
return 68;
}
}

return 0;
}

int mmap_junk(const char *fname, int flags, int sync_options, uint64_t file_size)
{
int fd, len;
uint64_t offset, generation = 0;
char *buf, *map;
long page_size;

page_size = sysconf(_SC_PAGESIZE);
if (page_size < 0) {
perror("_SC_PAGESIZE");
return 101;
}

fd = open(fname, flags | O_RDWR);
if (fd < 1) {
perror(fname);
return 96;
}

len = posix_memalign((void **)&buf, bufsize, bufsize);
if (len) {
errno = len;
perror("alloc");
return 102;
}

map = mmap(NULL, file_size, PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
perror(fname);
return 97;
}

while (1) {
len = snprintf(buf, bufsize - 1, "%d - %"PRIu64, getpid(), generation++);
if (flags & O_DIRECT) {
len = bufsize;
offset = get_randnum_align(0, file_size - len, bufsize);
} else {
offset = get_randnum(0, file_size - len);
}

memcpy(map + offset, buf, len);
len += offset & (page_size - 1);
offset &= ~(page_size - 1);
if ((sync_options & SYNC_RANGE) && msync(map + offset, len, MS_SYNC | MS_INVALIDATE)) {
perror("msync");
munmap(map, file_size);
close(fd);
free(buf);
return 99;
}
if ((sync_options & SYNC_FILE) && fsync(fd)) {
perror("fsync");
munmap(map, file_size);
close(fd);
free(buf);
return 100;
}
}

return 0;
}

int main(int argc, char *argv[])
{
int opt, fd;
unsigned long i, mthreads = 0, nthreads = 1;
char *fname = NULL;
int flags = 0, sync_options = 0;
uint64_t file_size = 0;
pid_t pid;
int status;

while ((opt = getopt(argc, argv, "b:dn:l:srfm:")) != -1) {
switch (opt) {
case 'd':
flags |= O_DIRECT;
break;
case 'n':
nthreads = strtoul(optarg, NULL, 0);
break;
case 'm':
mthreads = strtoul(optarg, NULL, 0);
break;
case 'l':
file_size = strtoull(optarg, NULL, 0);
break;
case 's':
flags |= O_SYNC;
break;
case 'b':
bufsize = strtoul(optarg, NULL, 0);
break;
case 'r':
sync_options |= SYNC_RANGE;
break;
case 'f':
sync_options |= SYNC_FILE;
break;
default:
help(argv[0]);
return 4;
}
}

if (optind != argc - 1 || file_size < 1) {
help(argv[0]);
return 1;
}

fname = argv[optind];

if (!seed_random())
return 2;

// truncate first
fd = open(fname, flags | O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fd < 0) {
perror(fname);
return 3;
}
close(fd);

// spawn threads and go to town
if (nthreads == 1)
return write_junk(fname, flags, sync_options, file_size);

for (i = 0; i < nthreads; i++) {
pid = fork();
if (!pid)
return write_junk(fname, flags, sync_options, file_size);
}

for (i = 0; i < mthreads; i++) {
pid = fork();
if (!pid)
return mmap_junk(fname, flags, sync_options, file_size);
}

for (i = 0; i < mthreads + nthreads; i++) {
pid = wait(&status);
if (WIFEXITED(status))
printf("Child %d exited with status %d\n", pid, WEXITSTATUS(status));
}

return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2011-02-22 22:53:51 UTC
Permalink
Post by Darrick J. Wong
Post by Andreas Dilger
Post by Darrick J. Wong
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Can you please attach your reproducer? IIRC it needed mmap() to hit this
problem? Did you measure CPU usage during your testing?
I didn't need mmap; a lot of threads using write() was enough. (The reproducer
program does have a mmap mode though). Basically it creates a lot of threads
to write small blobs to random offsets in a file, with optional mmap, dio, and
sync options.
*nod*

Both mmap and write paths need to be block on
wait_for_page_writeback(page) once they have a locked page ready for
modification. btrfs does this in btrfs_page_mkwrite() and
prepare_pages(), so adding similar calls into block_page_mkwrite()
and grab_cache_page_write_begin() would probably fix the problem for
the other major filesystems....
Post by Darrick J. Wong
Agreed. I too am curious to study which circumstances favor copying vs
blocking.
IMO blocking is generally preferable in high throughput threaded
workloads as there is always another thread that can do useful work
while we wait for IO to complete. Most use cases for DIF center
around high throughput environments....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Martin K. Petersen
2011-02-23 16:24:50 UTC
Permalink
Post by Darrick J. Wong
Agreed. I too am curious to study which circumstances favor copying
vs blocking.
Dave> IMO blocking is generally preferable in high throughput threaded
Dave> workloads as there is always another thread that can do useful
Dave> work while we wait for IO to complete. Most use cases for DIF
Dave> center around high throughput environments....

Yeah.

A while back I did a bunch of tests with a liberal amount of
wait_on_page_writeback() calls added to (I think) ext2, ext3, and
XFS. For my regular workloads there was no measurable change (kernel
builds, random database and I/O tests). I'm sure we'll unearth some apps
that will suffer when DI is on but so far I'm not too worried about
blocking in the data path.

My main concern is wrt. metadata because that's where extN really
hurts. Simple test: Unpack a kernel tarball and watch the directory
block fireworks. Given how frequently those buffers get hit I'm sure
blocking would cause performance to tank completely. I looked into
fixing this in ext2 but I had to stop because my eyes were bleeding.
--
Martin K. Petersen Oracle Linux Engineering
Jan Kara
2011-02-24 16:43:35 UTC
Permalink
Post by Martin K. Petersen
Post by Darrick J. Wong
Agreed. I too am curious to study which circumstances favor copying
vs blocking.
Dave> IMO blocking is generally preferable in high throughput threaded
Dave> workloads as there is always another thread that can do useful
Dave> work while we wait for IO to complete. Most use cases for DIF
Dave> center around high throughput environments....
Yeah.
A while back I did a bunch of tests with a liberal amount of
wait_on_page_writeback() calls added to (I think) ext2, ext3, and
XFS. For my regular workloads there was no measurable change (kernel
builds, random database and I/O tests). I'm sure we'll unearth some apps
that will suffer when DI is on but so far I'm not too worried about
blocking in the data path.
My main concern is wrt. metadata because that's where extN really
hurts. Simple test: Unpack a kernel tarball and watch the directory
block fireworks. Given how frequently those buffers get hit I'm sure
blocking would cause performance to tank completely. I looked into
fixing this in ext2 but I had to stop because my eyes were bleeding.
Ext2 is problematic yes, but ext[34] should be OK because we do
metadata copy anyway because of journalling. So for ext[34] you shouldn't
need any additional metadata protection since JBD does it for you (apart
from nojournal mode of ext4 of course).

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-02-28 08:49:29 UTC
Permalink
Post by Andreas Dilger
Can you please attach your reproducer? IIRC it needed mmap() to hit this problem? Did you measure CPU usage during your testing?
Running XFSQA on a DIF target reproduces it. Even scsi_debug in the
right mode is enough.
Martin K. Petersen
2011-02-22 16:45:44 UTC
Permalink
Darrick> Unfortunately, I didn't get a sense that any sort of consensus
Darrick> had been reached

We had a session on this topic at the storage workshop in Auguest. The
consensus was that filesystems really need to handle this.

Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.

The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kara
2011-02-24 16:47:58 UTC
Permalink
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
Chris Mason
2011-02-24 17:37:53 UTC
Permalink
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.

A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.

-chris
Darrick J. Wong
2011-02-24 18:27:32 UTC
Permalink
Post by Chris Mason
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.

Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-02-28 12:54:05 UTC
Permalink
Post by Darrick J. Wong
Post by Chris Mason
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)
Here's an ext4 version of the debugging patch. It's a few years old but
it'll give you the idea. This only covers metadata pages.

Looks like I hacked the btrfs version up and didn't keep the original,
I'll have to rework it, I was trying to use it for the big corruption I
fixed recently and made a bunch of changes.

For data if mmap is giving you trouble you need to wait on writeback in
page_mkwrite, with the page locked. fs/btrfs/inode.c has our
page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
ordered write code. But for the other filesystems, waiting on writeback
should be enough.

-chris

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index d4cfd6d..9f278db 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -28,6 +28,16 @@
#include <linux/blkdev.h>
#include <trace/events/jbd2.h>

+int set_memory_ro(unsigned long addr, int numpages);
+
+static int set_page_ro(struct page *page)
+{
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_ro(addr, 1);
+}
+
/*
* Default IO end handler for temporary BJ_IO buffer_heads.
*/
@@ -639,6 +654,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
set_bit(BH_JWrite, &jh2bh(new_jh)->b_state);
wbuf[bufs++] = jh2bh(new_jh);

+ set_page_ro(jh2bh(new_jh)->b_page);
+
/* Record the new block's tag in the current descriptor
buffer */

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a051270..153888e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -28,6 +28,15 @@
#include <linux/hrtimer.h>

static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);
+int set_memory_rw(unsigned long addr, int numpages);
+static int set_page_rw(struct page *page)
+{
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_rw(addr, 1);
+}
+

/*
* jbd2_get_transaction: obtain a new transaction_t object.
@@ -1474,9 +1487,11 @@ void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
break;
case BJ_IO:
list = &transaction->t_iobuf_list;
+ set_page_rw(bh->b_page);
break;
case BJ_Shadow:
list = &transaction->t_shadow_list;
+ set_page_rw(bh->b_page);
break;
case BJ_LogCtl:
list = &transaction->t_log_list;
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-03-04 21:07:24 UTC
Permalink
Post by Chris Mason
Post by Darrick J. Wong
Post by Chris Mason
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)
Here's an ext4 version of the debugging patch. It's a few years old but
it'll give you the idea. This only covers metadata pages.
Looks like I hacked the btrfs version up and didn't keep the original,
I'll have to rework it, I was trying to use it for the big corruption I
fixed recently and made a bunch of changes.
For data if mmap is giving you trouble you need to wait on writeback in
page_mkwrite, with the page locked. fs/btrfs/inode.c has our
page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
ordered write code. But for the other filesystems, waiting on writeback
should be enough.
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.

The set_memory_ro debugging trick didn't ferret out any write paths that I
didn't catch... though it did have the effect of causing occasional fsync()
deadlocks. I suppose I could sprinkle in a few more of those write calls to
see what happens.

Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
Or: Did I miss something?

Thanks all for the feedback so far!

--
fs: Wait for page writeback when rewrite detected

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

fs/buffer.c | 4 +++-
fs/ext4/inode.c | 3 +++
mm/filemap.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..39e934c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }

out:
return ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..2364704 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page,
struct inode *inode = page->mapping->host;

trace_ext4_writepage(inode, page);
+lock_page(page);
size = i_size_read(inode);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;

+wait_on_page_writeback(page);
+
/*
* If the page does not have buffers (for whatever reason),
* try to create them using __block_write_begin. If this
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..f201d80 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
struct page *page;
@@ -2243,6 +2243,17 @@ repeat:
}
return page;
}
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p)
+ wait_on_page_writeback(p);
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);

static ssize_t generic_perform_write(struct file *file,
Andreas Dilger
2011-03-04 22:22:30 UTC
Permalink
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
Did you add this to ext4_page_mkwrite() or only ext4_writepage()? It wasn't clear from your description above.
Post by Darrick J. Wong
The set_memory_ro debugging trick didn't ferret out any write paths that I
didn't catch... though it did have the effect of causing occasional fsync()
deadlocks. I suppose I could sprinkle in a few more of those write calls to
see what happens.
Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
Or: Did I miss something?
Thanks all for the feedback so far!
--
fs: Wait for page writeback when rewrite detected
---
fs/buffer.c | 4 +++-
fs/ext4/inode.c | 3 +++
mm/filemap.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..39e934c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
return ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..2364704 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page,
struct inode *inode = page->mapping->host;
trace_ext4_writepage(inode, page);
+lock_page(page);
size = i_size_read(inode);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
+wait_on_page_writeback(page);
+
/*
* If the page does not have buffers (for whatever reason),
* try to create them using __block_write_begin. If this
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..f201d80 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
struct page *page;
}
return page;
}
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p)
+ wait_on_page_writeback(p);
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
static ssize_t generic_perform_write(struct file *file,
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
Chris Mason
2011-03-07 21:12:18 UTC
Permalink
Post by Darrick J. Wong
Post by Chris Mason
Post by Darrick J. Wong
Post by Chris Mason
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)
Here's an ext4 version of the debugging patch. It's a few years old but
it'll give you the idea. This only covers metadata pages.
Looks like I hacked the btrfs version up and didn't keep the original,
I'll have to rework it, I was trying to use it for the big corruption I
fixed recently and made a bunch of changes.
For data if mmap is giving you trouble you need to wait on writeback in
page_mkwrite, with the page locked. fs/btrfs/inode.c has our
page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
ordered write code. But for the other filesystems, waiting on writeback
should be enough.
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
Hi, thanks for working on this.

Btrfs waits for page writeback but then it also waits for its own
ordered extents, which is basically tracks writeback and some extra
btrfs accounting. If you take out the btrfs_lookup_ordered_extent bits,
and the waiting on page writeback, you'll see fireworks.

Your patch changes ext4_writepage, but by the time writepage is called
we should already have waited for PageWriteback. So that should be
BUG_ON(PageWriteback(page)). And ext4_writepage should be called with
the page locked...so your extra lock_page call should be deadlocking.
Maybe all the writes are going through writepages instead?

You really want to be changing ext4_page_mkwrite, that's where all the
magic for mmap really happens. The block_page_mkwrite call is a good
start, but ext4 doesn't use it.

-chris
Dave Chinner
2011-03-08 04:56:26 UTC
Permalink
Post by Darrick J. Wong
Post by Chris Mason
Post by Darrick J. Wong
Post by Chris Mason
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)
Here's an ext4 version of the debugging patch. It's a few years old but
it'll give you the idea. This only covers metadata pages.
Looks like I hacked the btrfs version up and didn't keep the original,
I'll have to rework it, I was trying to use it for the big corruption I
fixed recently and made a bunch of changes.
For data if mmap is giving you trouble you need to wait on writeback in
page_mkwrite, with the page locked. fs/btrfs/inode.c has our
page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
ordered write code. But for the other filesystems, waiting on writeback
should be enough.
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
The set_memory_ro debugging trick didn't ferret out any write paths that I
didn't catch... though it did have the effect of causing occasional fsync()
deadlocks. I suppose I could sprinkle in a few more of those write calls to
see what happens.
Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
Or: Did I miss something?
Thanks all for the feedback so far!
--
fs: Wait for page writeback when rewrite detected
---
fs/buffer.c | 4 +++-
fs/ext4/inode.c | 3 +++
mm/filemap.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..39e934c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
I think this needs to wait before the __block_write_begin() call,
not after it. i.e. wait before the page is mapped, not afterwards.

....
Post by Darrick J. Wong
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..f201d80 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
struct page *page;
}
return page;
}
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p)
+ wait_on_page_writeback(p);
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
Not much point in add in a wrapper when nothing else calls
__grab_cache_page_write_begin(), which should also be static....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-03-10 23:57:22 UTC
Permalink
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
Hrm... I've been looking for a file_write in ext4; was the aio_write function
pointer what you had in mind here?

Adding a wait_on_page_writeback to ext4_page_mkwrite eliminated most of the DIF
checksum errors in the mmap case. Unfortunately, I still see them, usually
within the first few seconds of kicking off the not-first run. I suspect that
may be related to the test program truncating the file after each run causing
blocks to come and go, so I'll investigate that part of ext4 next.

I also noticed that testing the directio+mmap case exhibits the same symptoms.

Anyhow, just attaching the latest hack of mine in case anyone wants to have a
look.
--
fs: Wait for page writeback when rewrite detected
---

fs/buffer.c | 2 ++
fs/ext4/inode.c | 6 +++++-
mm/filemap.c | 19 +++++++++++++++++--
mm/memory.c | 3 +++
4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..f3639f2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2369,6 +2369,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
else
end = PAGE_CACHE_SIZE;

+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
ret = block_commit_write(page, 0, end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..ba45b31 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2762,6 +2762,8 @@ static int ext4_writepage(struct page *page,
*/
goto redirty_page;
}
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
if (commit_write)
/* now mark the buffer_heads as dirty and uptodate */
block_commit_write(page, 0, len);
@@ -5865,12 +5867,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (PageMappedToDisk(page))
goto out_unlock;

+ lock_page(page);
+ /* this one seems to handle mmap */
+ wait_on_page_writeback(page);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;

- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..61af700 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,9 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+static struct page *
+__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index,
+ unsigned flags)
{
int status;
struct page *page;
@@ -2243,6 +2244,20 @@ repeat:
}
return page;
}
+
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p) {
+ WARN_ON(!PageLocked(p));
+ wait_on_page_writeback(p);
+ }
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);

static ssize_t generic_perform_write(struct file *file,
Chris Mason
2011-03-11 16:34:05 UTC
Permalink
Post by Darrick J. Wong
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
Hrm... I've been looking for a file_write in ext4; was the aio_write function
pointer what you had in mind here?
Your change to grab_cache_page_write_begin looks good to me, at least
for ext4. For ext3 you have to actually go in and wait for each of the
buffer heads in the page, since ext3 (and reiserfs) will write the buffer heads
directly without using writepage.

Have you confirmed by looking at the block mapping that your crc errors
are from data blocks?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-03-11 18:51:18 UTC
Permalink
Post by Chris Mason
Post by Darrick J. Wong
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
Hrm... I've been looking for a file_write in ext4; was the aio_write function
pointer what you had in mind here?
Your change to grab_cache_page_write_begin looks good to me, at least
for ext4. For ext3 you have to actually go in and wait for each of the
buffer heads in the page, since ext3 (and reiserfs) will write the buffer heads
directly without using writepage.
Heh, I hadn't really given much thought to ext2/3 yet. I'm still trying to
figure out what causes the occasional ext4 failures. :)
Post by Chris Mason
Have you confirmed by looking at the block mapping that your crc errors
are from data blocks?
Yes, debugfs confirms that they are data blocks.

--D
Darrick J. Wong
2011-03-19 00:07:55 UTC
Permalink
Post by Dave Chinner
Post by Darrick J. Wong
Post by Chris Mason
Post by Darrick J. Wong
Post by Chris Mason
Post by Jan Kara
Post by Martin K. Petersen
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)
Here's an ext4 version of the debugging patch. It's a few years old but
it'll give you the idea. This only covers metadata pages.
Looks like I hacked the btrfs version up and didn't keep the original,
I'll have to rework it, I was trying to use it for the big corruption I
fixed recently and made a bunch of changes.
For data if mmap is giving you trouble you need to wait on writeback in
page_mkwrite, with the page locked. fs/btrfs/inode.c has our
page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
ordered write code. But for the other filesystems, waiting on writeback
should be enough.
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
I wonder, is it possible for this to happen:

1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*

After letting the stress tool run for a few days, I can say fairly confidently
that the write() case doesn't seem to fail regardless of the O_DIRECT setting.
However, with writes to mmap regions, failures happen about once every 20-40
minutes, even with O_DIRECT set. To me this suggests some sort of race
condition that we seem to win except once every 20 minutes.

I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps
there's something that I could do just prior to calculating the DIF checksum
that would cause any subsequent write attempts to be shuffled back into
page_mkwrite. I tried the set_memory_ro thing again, though that led to some
recursive lock errors and I noticed that those functions only seem to exist in
arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling
messy, only seemed to corrupt memory. :)

Is there a "correct" way to take a writeable page and make it so that any
process trying to write to it ends up hitting the page fault handler where we
can then wait for writeback? Or perhaps I am simply barking up the wrong tree?

(Just FYI I took the old copy-everything-to-bounce-buffers patch that few
people liked for a second spin, and the errors did not surface regardless of
what combination of write/mmap and directio/bufferedio I told it to use.)

--D
Post by Dave Chinner
Post by Darrick J. Wong
The set_memory_ro debugging trick didn't ferret out any write paths that I
didn't catch... though it did have the effect of causing occasional fsync()
deadlocks. I suppose I could sprinkle in a few more of those write calls to
see what happens.
Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
Or: Did I miss something?
Thanks all for the feedback so far!
--
fs: Wait for page writeback when rewrite detected
---
fs/buffer.c | 4 +++-
fs/ext4/inode.c | 3 +++
mm/filemap.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..39e934c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
I think this needs to wait before the __block_write_begin() call,
not after it. i.e. wait before the page is mapped, not afterwards.
....
Post by Darrick J. Wong
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..f201d80 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
struct page *page;
}
return page;
}
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p)
+ wait_on_page_writeback(p);
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
Not much point in add in a wrapper when nothing else calls
__grab_cache_page_write_begin(), which should also be static....
Cheers,
Dave.
--
Dave Chinner
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2011-03-19 02:28:26 UTC
Permalink
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
Right, page_mkwrite() is only called for the ro->rw transition.
Post by Darrick J. Wong
5. Disk gets the page, which now doesn't match its checksum, and *boom*
After letting the stress tool run for a few days, I can say fairly confidently
that the write() case doesn't seem to fail regardless of the O_DIRECT setting.
However, with writes to mmap regions, failures happen about once every 20-40
minutes, even with O_DIRECT set. To me this suggests some sort of race
condition that we seem to win except once every 20 minutes.
I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps
there's something that I could do just prior to calculating the DIF checksum
that would cause any subsequent write attempts to be shuffled back into
page_mkwrite. I tried the set_memory_ro thing again, though that led to some
recursive lock errors and I noticed that those functions only seem to exist in
arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling
messy, only seemed to corrupt memory. :)
This seems like the best solution, IMHO, to ensure that mmap is blocked in page_mkwrite() before it has any chance to dirty the page undergoing checksum. The trick is that you need to set_page_writeback() before setting the page read-only, otherwise the race still exists.
Post by Darrick J. Wong
Is there a "correct" way to take a writeable page and make it so that any
process trying to write to it ends up hitting the page fault handler where we
can then wait for writeback? Or perhaps I am simply barking up the wrong tree?
(Just FYI I took the old copy-everything-to-bounce-buffers patch that few
people liked for a second spin, and the errors did not surface regardless of
what combination of write/mmap and directio/bufferedio I told it to use.)
I wouldn't be so much against memcpy() for mmap pages, but it does seem kind of gross that mmap is forcing data copies when a major reason to use mmap is to AVOID data copies.
Post by Darrick J. Wong
Post by Darrick J. Wong
Post by Darrick J. Wong
The set_memory_ro debugging trick didn't ferret out any write paths that I
didn't catch... though it did have the effect of causing occasional fsync()
deadlocks. I suppose I could sprinkle in a few more of those write calls to
see what happens.
Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
Or: Did I miss something?
Thanks all for the feedback so far!
--
fs: Wait for page writeback when rewrite detected
---
fs/buffer.c | 4 +++-
fs/ext4/inode.c | 3 +++
mm/filemap.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..39e934c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
I think this needs to wait before the __block_write_begin() call,
not after it. i.e. wait before the page is mapped, not afterwards.
....
Post by Darrick J. Wong
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..f201d80 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
struct page *page;
}
return page;
}
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p)
+ wait_on_page_writeback(p);
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
Not much point in add in a wrapper when nothing else calls
__grab_cache_page_write_begin(), which should also be static....
Cheers,
Dave.
--
Dave Chinner
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-03-22 19:23:01 UTC
Permalink
Post by Andreas Dilger
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
Right, page_mkwrite() is only called for the ro->rw transition.
Post by Darrick J. Wong
5. Disk gets the page, which now doesn't match its checksum, and *boom*
After letting the stress tool run for a few days, I can say fairly confidently
that the write() case doesn't seem to fail regardless of the O_DIRECT setting.
However, with writes to mmap regions, failures happen about once every 20-40
minutes, even with O_DIRECT set. To me this suggests some sort of race
condition that we seem to win except once every 20 minutes.
I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps
there's something that I could do just prior to calculating the DIF checksum
that would cause any subsequent write attempts to be shuffled back into
page_mkwrite. I tried the set_memory_ro thing again, though that led to some
recursive lock errors and I noticed that those functions only seem to exist in
arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling
messy, only seemed to corrupt memory. :)
This seems like the best solution, IMHO, to ensure that mmap is blocked in
page_mkwrite() before it has any chance to dirty the page undergoing
checksum. The trick is that you need to set_page_writeback() before setting
the page read-only, otherwise the race still exists.
I figured out that the recursive locking errors only happened in the
set_memory_rw half of the ro/rw memory pair, and that I could make them go away
(for now) by do set_memory_rw in the kintegrityd workqueue. Then I added a
call to set_page_writeback just prior to the set_memory_ro call, though that
resulted in a lot of complaints about invalid page states and the like. It
would seem that the memory pages that arrive in bio_integrity_prep from jbd2
don't have the writeback flag set, and setting it causes problems for it. The
writeback flag is set on all the pages that are associated with a checksum
failure, I noticed.

As for changing pte's around... does that set_memory_ro change the pte flags
for all running processes? I'm not so sure it does for anything other than the
current process. I think I saw a flush_tlb call in there... though I don't
think it helps me much.

If I /don't/ set the flag, the frequency of the errors decreases further to
about once an hour, but I still see the occasional error. :/ Currently I'm
trying to figure out how one might distinguish dirty pages that shouldn't have
writeback set vs. pages that ought to have it but don't.

I suppose I could pull out the 're-checksum and resubmit' patch I made a while
back, though it seems like a bandaid.
Post by Andreas Dilger
Post by Darrick J. Wong
Is there a "correct" way to take a writeable page and make it so that any
process trying to write to it ends up hitting the page fault handler where we
can then wait for writeback? Or perhaps I am simply barking up the wrong tree?
(Just FYI I took the old copy-everything-to-bounce-buffers patch that few
people liked for a second spin, and the errors did not surface regardless of
what combination of write/mmap and directio/bufferedio I told it to use.)
I wouldn't be so much against memcpy() for mmap pages, but it does seem kind
of gross that mmap is forcing data copies when a major reason to use mmap is
to AVOID data copies.
Yeah. We probably want to avoid having to find extra pages too. :(

--D
Jan Kara
2011-03-22 21:54:30 UTC
Permalink
Post by Darrick J. Wong
Post by Andreas Dilger
This seems like the best solution, IMHO, to ensure that mmap is blocked in
page_mkwrite() before it has any chance to dirty the page undergoing
checksum. The trick is that you need to set_page_writeback() before setting
the page read-only, otherwise the race still exists.
I figured out that the recursive locking errors only happened in the
set_memory_rw half of the ro/rw memory pair, and that I could make them go away
(for now) by do set_memory_rw in the kintegrityd workqueue. Then I added a
call to set_page_writeback just prior to the set_memory_ro call, though that
resulted in a lot of complaints about invalid page states and the like. It
would seem that the memory pages that arrive in bio_integrity_prep from jbd2
don't have the writeback flag set, and setting it causes problems for it. The
writeback flag is set on all the pages that are associated with a checksum
failure, I noticed.
Yeah, pages submitted by jbd2 don't have writeback flag set because they
are metadata blocks written directly via buffer heads. But as you noted,
these are protected in a different way by the journalling layer so
shouldn't need to worry.
Post by Darrick J. Wong
As for changing pte's around... does that set_memory_ro change the pte flags
for all running processes? I'm not so sure it does for anything other than the
current process. I think I saw a flush_tlb call in there... though I don't
think it helps me much.
If I /don't/ set the flag, the frequency of the errors decreases further to
about once an hour, but I still see the occasional error. :/ Currently I'm
trying to figure out how one might distinguish dirty pages that shouldn't have
writeback set vs. pages that ought to have it but don't.
It's difficult at the block layer... If page->mapping->host is not a
device inode, the page should have PageWriteback set. If it is a device
inode, you don't know - JBD2 will submit pages without PageWriteback set,
flusher thread will submit pages with PageWriteback set. And both is OK
since we use buffer state for synchronization.

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
Jan Kara
2011-03-21 14:04:51 UTC
Permalink
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
is:
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)

So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case) and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
Chris Mason
2011-03-21 14:24:41 UTC
Permalink
Post by Jan Kara
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case) and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
Looking at the ext4_page_mkwrite, it does this:

lock the page
check for holes
unlock the page
if (no_holes)
return;

write_begin/write_end
return

So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.

This is untested and uncompiled, but it should
do the trick.

Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
Jan Kara
2011-03-21 16:43:05 UTC
Permalink
Post by Jan Kara
Post by Jan Kara
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case)
I should add that if you are computing the checksum in the block layer
once the bio is submitted, you obviously are computing it after the page is
marked as writeback. So that should be fine...
Post by Jan Kara
Post by Jan Kara
and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
lock the page
check for holes
unlock the page
if (no_holes)
return;
write_begin/write_end
return
So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.
Oh right, that's a good point.
Post by Jan Kara
This is untested and uncompiled, but it should
do the trick.
Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.
Yes, ext4 shouldn't do any buffer based writeback.
Post by Jan Kara
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
if (ret)
ret = VM_FAULT_SIGBUS;
This looks good AFAICT.

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
Darrick J. Wong
2011-04-06 23:29:38 UTC
Permalink
Post by Jan Kara
Post by Jan Kara
Post by Jan Kara
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case)
I should add that if you are computing the checksum in the block layer
once the bio is submitted, you obviously are computing it after the page is
marked as writeback. So that should be fine...
Post by Jan Kara
Post by Jan Kara
and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
lock the page
check for holes
unlock the page
if (no_holes)
return;
write_begin/write_end
return
So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.
Oh right, that's a good point.
Post by Jan Kara
This is untested and uncompiled, but it should
do the trick.
Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.
Yes, ext4 shouldn't do any buffer based writeback.
Post by Jan Kara
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
if (ret)
ret = VM_FAULT_SIGBUS;
This looks good AFAICT.
I gave this a spin a couple of weeks ago (and accidentally left the test
machines running for a full week!) From what I can tell, with all the various
wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
down to about 7-8 per day. Not bad, but not fixed.

On the odd chance that jbd2 really can provide stable pages during writeback, I
am now rerunning the test with no patches and data=journal, while noting that
(a) DIO mode doesn't work with data=journal and (b) the first write failure
will probably cause the journal to abort == game over. When that's done I'll
give the wait-for-writeback patches a whirl with 2.6.39-rc.

Enjoy the warm SF weather!

--D
Darrick J. Wong
2011-04-07 16:44:06 UTC
Permalink
Post by Darrick J. Wong
Post by Jan Kara
Post by Jan Kara
Post by Jan Kara
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case)
I should add that if you are computing the checksum in the block layer
once the bio is submitted, you obviously are computing it after the page is
marked as writeback. So that should be fine...
Post by Jan Kara
Post by Jan Kara
and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
lock the page
check for holes
unlock the page
if (no_holes)
return;
write_begin/write_end
return
So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.
Oh right, that's a good point.
Post by Jan Kara
This is untested and uncompiled, but it should
do the trick.
Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.
Yes, ext4 shouldn't do any buffer based writeback.
Post by Jan Kara
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
if (ret)
ret = VM_FAULT_SIGBUS;
This looks good AFAICT.
I gave this a spin a couple of weeks ago (and accidentally left the test
machines running for a full week!) From what I can tell, with all the various
wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
down to about 7-8 per day. Not bad, but not fixed.
On the odd chance that jbd2 really can provide stable pages during writeback, I
am now rerunning the test with no patches and data=journal, while noting that
(a) DIO mode doesn't work with data=journal and (b) the first write failure
will probably cause the journal to abort == game over. When that's done I'll
Heh, nope, even with data=journal I still see checksum errors. So much for
that hare-brained theory. :(

--D
Jan Kara
2011-04-07 16:57:00 UTC
Permalink
Post by Darrick J. Wong
Post by Jan Kara
Post by Jan Kara
Post by Jan Kara
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case)
I should add that if you are computing the checksum in the block layer
once the bio is submitted, you obviously are computing it after the page is
marked as writeback. So that should be fine...
Post by Jan Kara
Post by Jan Kara
and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
lock the page
check for holes
unlock the page
if (no_holes)
return;
write_begin/write_end
return
So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.
Oh right, that's a good point.
Post by Jan Kara
This is untested and uncompiled, but it should
do the trick.
Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.
Yes, ext4 shouldn't do any buffer based writeback.
Post by Jan Kara
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
if (ret)
ret = VM_FAULT_SIGBUS;
This looks good AFAICT.
I gave this a spin a couple of weeks ago (and accidentally left the test
machines running for a full week!) From what I can tell, with all the various
wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
down to about 7-8 per day. Not bad, but not fixed.
Ugh, strange. Can you post the full patch you are currently using? I've
already lost track of all the proposed changes... Thanks.

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-04-08 20:31:35 UTC
Permalink
Post by Jan Kara
Post by Darrick J. Wong
Post by Jan Kara
Post by Jan Kara
Post by Jan Kara
Post by Darrick J. Wong
Post by Darrick J. Wong
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case)
I should add that if you are computing the checksum in the block layer
once the bio is submitted, you obviously are computing it after the page is
marked as writeback. So that should be fine...
Post by Jan Kara
Post by Jan Kara
and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
lock the page
check for holes
unlock the page
if (no_holes)
return;
write_begin/write_end
return
So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.
Oh right, that's a good point.
Post by Jan Kara
This is untested and uncompiled, but it should
do the trick.
Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.
Yes, ext4 shouldn't do any buffer based writeback.
Post by Jan Kara
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
if (ret)
ret = VM_FAULT_SIGBUS;
This looks good AFAICT.
I gave this a spin a couple of weeks ago (and accidentally left the test
machines running for a full week!) From what I can tell, with all the various
wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
down to about 7-8 per day. Not bad, but not fixed.
Ugh, strange. Can you post the full patch you are currently using? I've
already lost track of all the proposed changes... Thanks.
Yes, me too. Attached is the giant patch I've been working on.

--D

fs: Wait for page writeback when rewrite detected, and mark pages ro during writeback

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..dd429fe 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2357,6 +2357,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
else
end = PAGE_CACHE_SIZE;

+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
ret = block_commit_write(page, 0, end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1a86282..57cd028 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2666,6 +2666,8 @@ static int ext4_writepage(struct page *page,
*/
goto redirty_page;
}
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
if (commit_write)
/* now mark the buffer_heads as dirty and uptodate */
block_commit_write(page, 0, len);
@@ -2778,7 +2780,8 @@ static int write_cache_pages_da(struct address_space *mapping,
}

lock_page(page);
-
+ if (PageWriteback(page))
+ wait_on_page_writeback(page);
/*
* If the page is no longer dirty, or its
* mapping no longer corresponds to inode we
@@ -5803,12 +5806,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (PageMappedToDisk(page))
goto out_unlock;

+ lock_page(page);
+ /* this one seems to handle mmap */
+ wait_on_page_writeback(page);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;

- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
@@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened.
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..3ed13a3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2277,8 +2277,9 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+static struct page *
+__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index,
+ unsigned flags)
{
int status;
struct page *page;
@@ -2303,6 +2304,20 @@ repeat:
}
return page;
}
+
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p) {
+ WARN_ON(!PageLocked(p));
+ wait_on_page_writeback(p);
+ }
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);

static ssize_t generic_perform_write(struct file *file,
diff --git a/mm/memory.c b/mm/memory.c
index 9da8cab..17fb560 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3146,6 +3146,9 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else
VM_BUG_ON(!PageLocked(page));
page_mkwrite = 1;
+ } else {
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
}
}

diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 9c5e6b2..7d2c57d 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -25,6 +25,7 @@
#include <linux/bio.h>
#include <linux/workqueue.h>
#include <linux/slab.h>
+#include <asm/tlbflush.h>

struct integrity_slab {
struct kmem_cache *slab;
@@ -382,6 +383,52 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
return 0;
}

+static int set_page_ro(struct page *page)
+{
+ int set_memory_ro(unsigned long addr, int numpages);
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_ro(addr, 1);
+}
+
+static int set_page_rw(struct page *page)
+{
+ int set_memory_rw(unsigned long addr, int numpages);
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_rw(addr, 1);
+}
+
+static void bio_integrity_write_fn(struct work_struct *work)
+{
+ struct bio_integrity_payload *bip =
+ container_of(work, struct bio_integrity_payload, bip_work);
+ struct bio *bio = bip->bip_bio;
+ struct bio_vec *from;
+ int i;
+
+ __bio_for_each_segment(from, bio, i, 0) {
+ set_page_rw(from->bv_page);
+ }
+
+ /* Restore original bio completion handler */
+ bio->bi_end_io = bip->bip_end_io;
+ bio_endio(bio, bip->bip_error);
+}
+
+static void bio_integrity_endio_write(struct bio *bio, int error)
+{
+ struct bio_integrity_payload *bip = bio->bi_integrity;
+
+ BUG_ON(bip->bip_bio != bio);
+
+ bip->bip_error = error;
+ INIT_WORK(&bip->bip_work, bio_integrity_write_fn);
+ queue_work(kintegrityd_wq, &bip->bip_work);
+}
+
/**
* bio_integrity_prep - Prepare bio for integrity I/O
* @bio: bio to prepare
@@ -468,8 +515,30 @@ int bio_integrity_prep(struct bio *bio)
}

/* Auto-generate integrity metadata if this is a write */
- if (bio_data_dir(bio) == WRITE)
+ if (bio_data_dir(bio) == WRITE) {
+ struct bio_vec *from;
+ int i;
+
+ bip->bip_end_io = bio->bi_end_io;
+ bio->bi_end_io = bio_integrity_endio_write;
+ __bio_for_each_segment(from, bio, i, 0) {
+ set_page_writeback(from->bv_page);
+#if 0
+ if (!PagePrivate(from->bv_page) &&
+ !PageWriteback(from->bv_page) &&
+ from->bv_page->mapping &&
+ from->bv_page->mapping->host &&
+ !from->bv_page->mapping->host->i_bdev)
+ {
+ printk(KERN_ERR "%s: writebacking file(?) page=%p flags=%lx mode=%x...\n", __FUNCTION__, from->bv_page, from->bv_page->flags, from->bv_page->mapping->host->i_mode);
+ set_page_writeback(from->bv_page);
+ }
+#endif
+ set_page_ro(from->bv_page);
+ flush_tlb_all();
+ }
bio_integrity_generate(bio);
+ }

return 0;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index dd429fe..02156c5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -376,8 +376,10 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
if (!quiet_error(bh)) {
buffer_io_error(bh);
printk(KERN_WARNING "lost page write due to "
- "I/O error on %s\n",
- bdevname(bh->b_bdev, b));
+ "I/O error on %s, flags=%lx\n",
+ bdevname(bh->b_bdev, b), page->flags);
+if (page->mapping && page->mapping->host)
+printk(KERN_ERR "mode=%x inode=%d dev=%d\n", page->mapping->host->i_mode, page->mapping->host->i_ino, (page->mapping->host->i_rdev));
}
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ce33e68..ea36e89 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -180,6 +180,7 @@ struct bio_integrity_payload {
unsigned short bip_vcnt; /* # of integrity bio_vecs */
unsigned short bip_idx; /* current bip_vec index */

+ int bip_error; /* bio completion status? */
struct work_struct bip_work; /* I/O completion */
struct bio_vec bip_vec[0]; /* embedded bvec array */
};
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mingming Cao
2011-04-14 00:48:48 UTC
Permalink
Oh, right. Currently ext4_page_mkwrite drops the page lock before
calling it's dirty the page (by write_begin() and write_end(). I
suspect regrab the lock() after write_end() (with your proposed change)
and returning with locked still leave the dirty by ext4_page_mkwrite
unlocked. We probably should to keep the page locked the page during
the entire ext4_page_mkwrite() call. Any reason to drop the page lock()
before calling aops->write_begin()?
write_begin takes the page lock by itself. That's one of the reasons why
block_page_mkwrite doesn't use plain ->write_begin / write_end, the
other beeing that we already get a page passed to use, so there's no
need to do the pagecache lookup or allocation done by
grab_cache_page_write_begin.
The best thing would be to completely drop ext4's current version
of page_mkwrite and start out with a copy of block_page_mkwrite which
has the journalling calls added back into it.
The problem is the locking order, we can't hold page lock then start the
journal lock. Kjournald will need to hold the journal lock first, then
commit, commit may need to callback writepages, which need to hold the
page lock.


I looked at ext3, in that case, ext3 even don't have ext3_page_mkwrite()
to do the stable page yet. It requires some block reservation/delayed
allocation for filling holes in mmaped IO case. Jan tried that before I
don't think the proposal get far.


Now looking back ext4_page_mkwrite(), it calls write_begin(), as long as
we do wait in write_begin() things would be fine, right? It seems
Darrick already did that wait per Dave Chinner's suggestion when grab
the page and lock that page. But still a puzzle to me why that's not
sufficient.


Mingming
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kara
2011-04-22 20:34:34 UTC
Permalink
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-04-26 00:37:38 UTC
Permalink
Post by Jan Kara
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.
Hm... in mpage_da_submit_io I see the following sequence of calls:

1. clear_page_dirty_for_io
2. possibly one of: ext4_bio_write_page or block_write_full_page.
If ext4_bio_write_page,
2a. kmem_cache_alloc
2b. set_page_writeback

Before and after #1, the page is locked but writeback is not set.

Before #2, the page must be locked and writeback must not be set, because both
of those two functions want to set the writeback bit themselves. However,
ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
sleep (I think).

Unfortunately, ext4_page_mkwrite will check for page locked, wait for page
writeback, and then return the page. I think it is theoretically possible for
#1 to trigger a page_mkwrite which completes before #2b, right? In which case
the thread that called mkwrite will think that the page isn't being written
out, and happily scribble on it during writeback. I could be wrong, but it
seems to me that one has to write-protect the page after setting the writeback
bit.

I guess we could call page_mkclean from bio_integrity_prep, though.

--D
Chris Mason
2011-04-26 11:33:18 UTC
Permalink
Post by Darrick J. Wong
Post by Jan Kara
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.
1. clear_page_dirty_for_io
2. possibly one of: ext4_bio_write_page or block_write_full_page.
If ext4_bio_write_page,
2a. kmem_cache_alloc
2b. set_page_writeback
Before and after #1, the page is locked but writeback is not set.
Before #2, the page must be locked and writeback must not be set, because both
of those two functions want to set the writeback bit themselves. However,
ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
sleep (I think).
Sleeping isn't the problem as long as you sleep with the page locked.
The idea is that writepage will:

1) lock the page
2) clear_page_dirty_for_io (which calls page_mkclean)
3) set_page_writeback()
4) unlock the page
5) start the IO

page_mkwrite will:

1) lock the page
2) wait on page writeback
3) do other stuff

So if ext is calling set_page_writeback() on an unlocked page, that's a
problem. Otherwise it should be working.

-chris
Darrick J. Wong
2011-05-03 01:59:31 UTC
Permalink
Post by Chris Mason
Post by Darrick J. Wong
Post by Jan Kara
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.
1. clear_page_dirty_for_io
2. possibly one of: ext4_bio_write_page or block_write_full_page.
If ext4_bio_write_page,
2a. kmem_cache_alloc
2b. set_page_writeback
Before and after #1, the page is locked but writeback is not set.
Before #2, the page must be locked and writeback must not be set, because both
of those two functions want to set the writeback bit themselves. However,
ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
sleep (I think).
Sleeping isn't the problem as long as you sleep with the page locked.
1) lock the page
2) clear_page_dirty_for_io (which calls page_mkclean)
3) set_page_writeback()
4) unlock the page
5) start the IO
1) lock the page
2) wait on page writeback
3) do other stuff
So if ext is calling set_page_writeback() on an unlocked page, that's a
problem. Otherwise it should be working.
You're right, at this point in time writepage and page_mkwrite in ext4 both
behave as you describe. I began backing out parts of my patches to
bio-integrity.c and discovered that with the current kernel (2.6.39-rc5) the
only part that seems useful is the set_memory_ro/rw pair from that old
debugging patch. Unfortunately, those two functions only seem to exist on x86;
I suppose I could port them to others. If that's even a sane idea.

--D
Post by Chris Mason
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-05-04 01:26:21 UTC
Permalink
Post by Darrick J. Wong
Post by Chris Mason
Post by Darrick J. Wong
Post by Jan Kara
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.
1. clear_page_dirty_for_io
2. possibly one of: ext4_bio_write_page or block_write_full_page.
If ext4_bio_write_page,
2a. kmem_cache_alloc
2b. set_page_writeback
Before and after #1, the page is locked but writeback is not set.
Before #2, the page must be locked and writeback must not be set, because both
of those two functions want to set the writeback bit themselves. However,
ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
sleep (I think).
Sleeping isn't the problem as long as you sleep with the page locked.
1) lock the page
2) clear_page_dirty_for_io (which calls page_mkclean)
3) set_page_writeback()
4) unlock the page
5) start the IO
1) lock the page
2) wait on page writeback
3) do other stuff
So if ext is calling set_page_writeback() on an unlocked page, that's a
problem. Otherwise it should be working.
You're right, at this point in time writepage and page_mkwrite in ext4 both
behave as you describe. I began backing out parts of my patches to
bio-integrity.c and discovered that with the current kernel (2.6.39-rc5) the
only part that seems useful is the set_memory_ro/rw pair from that old
debugging patch. Unfortunately, those two functions only seem to exist on x86;
I suppose I could port them to others. If that's even a sane idea.
Never mind, I looked around for anything that would result in the kernel trying
to map a page for the purpose of writing, and I think adding a
wait_on_writeback to get_cache_page_for_write will solve that last hole without
having to port set_memory_* to other arches. With that, a whole lot of code
falls out of the patches, which I will post shortly.

--D
Post by Darrick J. Wong
--D
Post by Chris Mason
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kara
2011-04-26 11:37:32 UTC
Permalink
Post by Darrick J. Wong
Post by Jan Kara
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.
1. clear_page_dirty_for_io
2. possibly one of: ext4_bio_write_page or block_write_full_page.
If ext4_bio_write_page,
2a. kmem_cache_alloc
2b. set_page_writeback
Before and after #1, the page is locked but writeback is not set.
Before #2, the page must be locked and writeback must not be set, because both
of those two functions want to set the writeback bit themselves. However,
ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
sleep (I think).
Yes, it can sleep. But the page remains locked until we set page as
writeback in both cases.
Post by Darrick J. Wong
Unfortunately, ext4_page_mkwrite will check for page locked, wait for
page writeback, and then return the page. I think it is theoretically
possible for #1 to trigger a page_mkwrite which completes before #2b,
right?
I'm not sure I understand but once the page is locked by ext4_writepages()
before #1, ext4_page_mkwrite() will block until it can get the page lock -
which can happen only after set_page_writeback() in #2 is done. So then
ext4_page_mkwrite() will block waiting for PageWriteback which gets cleared
after the IO is finished... Or did you mean something else?
Post by Darrick J. Wong
In which case the thread that called mkwrite will think that the
page isn't being written out, and happily scribble on it during
writeback. I could be wrong, but it seems to me that one has to
write-protect the page after setting the writeback bit.
Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
Darrick J. Wong
2011-05-04 17:37:04 UTC
Permalink
Hi all,

This is v3 of the stable-page-writes patchset for ext4. A lot of code has been
cut out since v2 of this patch set. For v3, the large hairy function to walk
the page tables of every process is gone since Chris Mason pointed out that
page_mkclean does what I need. The set_memory_* hack is also gone, since (I
think) the only time the kernel maps a file data blocks for writing is in the
buffered IO case. That leaves us with some surgery to ext4_page_mkwrite to
return locked pages and to be careful about re-checking the writeback status
after dropping and re-grabbing the page lock; and a slight modification to the
mm code to wait for page writeback when grabbing pages for buffered writes.
There are also some cleanups for wait_on_page_writeback use in ext4.

I ran my write-after-checksum ("wac") reproducer program to try to create the
DIF checksum errors by madly rewriting the same memory pages. In fact, I tried
the following combinations:

a. 64 write() threads + sync_file_range
b. 64 mmap write threads + msync
c. 32 write() threads + sync_file_range + 32 mmap write threads + msync
d. Same as C, but with all threads in directio mode
e. Same as A, but with all threads in directio mode
f. Same as B, but with all threads in directio mode

After some 44 hours of safety testing across 4 machines, I saw zero errors.
Before the patchset, I could run any of A-F for 10 seconds or less and have a
screen full of errors.

To assess the performance impact of stable page writes, I moved to a disk that
doesn't have DIF support so that I could measure just the impact of waiting for
writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
saw about a 12% performance decrease. I then reran the wac program with 64
threads and a 64MB file and saw about the same performance numbers. I will of
course be testing a wider range of hardware now that I have a functioning patch
set, though as I suspected the patchset only seems to impact workloads that
rewrite the same memory page frequently.

As always, questions and comments are welcome; and thank you to all the
previous reviewers of this patchset!

--D

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Hellwig
2011-05-04 18:46:44 UTC
Permalink
This seems to miss out on a lot of the generic functionality like
write_cache_pages and block_page_mkwrite and just patch it into
the ext4 copy & paste variants. Please make sure your patches also
work for filesystem that use more of the generic functionality like
xfs or ext2 (the latter one might be fun for the mmap case).

Also what's the status of btrfs? I remembered there was one or two
bits missing despite doing the right thing in most areas.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Chris Mason
2011-05-04 19:21:55 UTC
Permalink
Post by Christoph Hellwig
This seems to miss out on a lot of the generic functionality like
write_cache_pages and block_page_mkwrite and just patch it into
the ext4 copy & paste variants. Please make sure your patches also
work for filesystem that use more of the generic functionality like
xfs or ext2 (the latter one might be fun for the mmap case).
Probably after the block_commit_write in block_page_mkwrite()
Another question is, do we want to introduce a wait_on_stable_page_writeback()?

This would allow us to add a check against the bdi requesting stable
pages.
Post by Christoph Hellwig
Also what's the status of btrfs? I remembered there was one or two
bits missing despite doing the right thing in most areas.
As far as I know btrfs is getting it right. The only bit missing is the
one Nick Piggin pointed out where it is possible to change mmap'd O_DIRECT
memory in flight while a DIO is in progress. Josef has a test case that
demonstrates this.

Nick had a plan to fix it, but it involved redoing the get_user_pages
api.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Darrick J. Wong
2011-05-04 20:00:31 UTC
Permalink
Post by Chris Mason
Post by Christoph Hellwig
This seems to miss out on a lot of the generic functionality like
write_cache_pages and block_page_mkwrite and just patch it into
the ext4 copy & paste variants. Please make sure your patches also
work for filesystem that use more of the generic functionality like
xfs or ext2 (the latter one might be fun for the mmap case).
Probably after the block_commit_write in block_page_mkwrite()
Yes, I'm working on providing more generic fixes for ext3 & friends too, but
they're not really working yet, so I was posting the parts that fix ext4, since
they seem usable.
Post by Chris Mason
Another question is, do we want to introduce a wait_on_stable_page_writeback()?
This would allow us to add a check against the bdi requesting stable
pages.
Sounds like a good idea.
Post by Chris Mason
Post by Christoph Hellwig
Also what's the status of btrfs? I remembered there was one or two
bits missing despite doing the right thing in most areas.
As far as I know btrfs is getting it right. The only bit missing is the
one Nick Piggin pointed out where it is possible to change mmap'd O_DIRECT
memory in flight while a DIO is in progress. Josef has a test case that
demonstrates this.
Nick had a plan to fix it, but it involved redoing the get_user_pages
api.
I ran the same six tests A-F on btrfs and it reported -ENOSPC with 1% of the
space used, though until it did that I didn't see any checksum errors.

--D

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Darrick J. Wong
2011-05-04 23:57:06 UTC
Permalink
Post by Chris Mason
Post by Christoph Hellwig
This seems to miss out on a lot of the generic functionality like
write_cache_pages and block_page_mkwrite and just patch it into
the ext4 copy & paste variants. Please make sure your patches also
work for filesystem that use more of the generic functionality like
xfs or ext2 (the latter one might be fun for the mmap case).
Probably after the block_commit_write in block_page_mkwrite()
Another question is, do we want to introduce a wait_on_stable_page_writeback()?
Something like this here? It fixes block_page_mkwrite users and sticks in a
simple page_mkwrite for fses that don't provide one at all. From a quick wac
run it seems to make xfs work. ext2 seems to have some issues with modifying a
buffer_head's bh_data without locking the bh during the update, so I guess it
needs some review.

--D

--

fs: Modify/provide generic writepage/page_mkwrite functions to wait for writeback

Modify the generic writepage function, and add an empty page_mkwrite function,
to wait for page writeback to finish before allowing writes. This is so that
simple filesystems have stable pages during write operations.

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

fs/buffer.c | 1 +
mm/filemap.c | 10 ++++++++++
2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..cf9a795 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!ret)
ret = block_commit_write(page, 0, end);

+ wait_on_page_writeback(page);
if (unlikely(ret)) {
unlock_page(page);
if (ret == -ENOMEM)
diff --git a/mm/filemap.c b/mm/filemap.c
index c22675f..9cb4e51 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1713,8 +1713,18 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ return VM_FAULT_LOCKED;
+}
+
const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .page_mkwrite = empty_page_mkwrite,
};

/* This is used for a general mmap of a disk file */
--

Here's the beginning of a patch against ext2 also. It fixes most of the
checksum errors when metadata writes are in progress, though I suspect there
are more spots that I haven't caught yet.

--

ext2: Lock buffer_heads during metadata update

ext2 does not protect buffer head that represent metadata against modifications
during write operations. This is a problem if the memory buffer needs to be
stable during writes; I think there are a few more spots in ext2 that need this
treatment. :)

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

fs/ext2/inode.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 788e09a..5314b0b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1426,6 +1426,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
if (IS_ERR(raw_inode))
return -EIO;

+ lock_buffer(bh);
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT2_STATE_NEW)
@@ -1502,6 +1503,8 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
}
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n];
+ unlock_buffer(bh);
+
mark_buffer_dirty(bh);
if (do_sync) {
sync_dirty_buffer(bh);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jan Kara
2011-05-05 15:26:01 UTC
Permalink
Hello,
Post by Darrick J. Wong
Post by Chris Mason
Post by Christoph Hellwig
This seems to miss out on a lot of the generic functionality like
write_cache_pages and block_page_mkwrite and just patch it into
the ext4 copy & paste variants. Please make sure your patches also
work for filesystem that use more of the generic functionality like
xfs or ext2 (the latter one might be fun for the mmap case).
Probably after the block_commit_write in block_page_mkwrite()
Another question is, do we want to introduce a wait_on_stable_page_writeback()?
Something like this here? It fixes block_page_mkwrite users and sticks in a
simple page_mkwrite for fses that don't provide one at all. From a quick wac
run it seems to make xfs work. ext2 seems to have some issues with modifying a
buffer_head's bh_data without locking the bh during the update, so I guess it
needs some review.
Yes, ext2 is rather difficult because of all the metadata updates to
buffers happening. That would need a serious work I suspect.
Post by Darrick J. Wong
fs: Modify/provide generic writepage/page_mkwrite functions to wait for writeback
Modify the generic writepage function, and add an empty page_mkwrite function,
to wait for page writeback to finish before allowing writes. This is so that
simple filesystems have stable pages during write operations.
---
fs/buffer.c | 1 +
mm/filemap.c | 10 ++++++++++
2 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..cf9a795 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!ret)
ret = block_commit_write(page, 0, end);
+ wait_on_page_writeback(page);
if (unlikely(ret)) {
unlock_page(page);
if (ret == -ENOMEM)
diff --git a/mm/filemap.c b/mm/filemap.c
index c22675f..9cb4e51 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
}
EXPORT_SYMBOL(filemap_fault);
+static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ return VM_FAULT_LOCKED;
+}
+
I guess you miss the whether the page has been truncated here (in which
case you should return VM_FAULT_NOPAGE).
Post by Darrick J. Wong
const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .page_mkwrite = empty_page_mkwrite,
};
/* This is used for a general mmap of a disk file */
Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Darrick J. Wong
2011-05-04 17:39:17 UTC
Permalink
wait_on_page_writeback already checks the writeback bit, so callers of it
needn't do that test.

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

fs/ext4/inode.c | 4 +---
fs/ext4/move_extent.c | 3 +--
2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..3db34b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2796,9 +2796,7 @@ static int write_cache_pages_da(struct address_space *mapping,
continue;
}

- if (PageWriteback(page))
- wait_on_page_writeback(page);
-
+ wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));

if (mpd->next_page != page->index)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index d5c5783..d1548b1 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -876,8 +876,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
* It needs to call wait_on_page_writeback() to wait for the
* writeback of the page.
*/
- if (PageWriteback(page))
- wait_on_page_writeback(page);
+ wait_on_page_writeback(page);

/* Release old bh and drop refs */
try_to_release_page(page, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Darrick J. Wong
2011-05-04 17:41:36 UTC
Permalink
In order to stabilize pages during disk writes, ext4_page_mkwrite must wait for
writeback operations to complete before making a page writable. Furthermore,
the function must return locked pages, and recheck the writeback status if the
page lock is ever dropped.

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

fs/ext4/inode.c | 24 +++++++++++++++++++-----
1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3db34b2..1d162a2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5809,15 +5809,19 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
goto out_unlock;
}
ret = 0;
- if (PageMappedToDisk(page))
- goto out_unlock;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ if (PageMappedToDisk(page)) {
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
+ }

if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;

- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
@@ -5827,8 +5831,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
- unlock_page(page);
- goto out_unlock;
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
}
}
unlock_page(page);
@@ -5848,6 +5852,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened.
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2011-05-04 17:42:27 UTC
Permalink
When grabbing a page for a buffered IO write, the mm should wait for writeback
on the page to complete so that the page does not become writable during the IO
operation.

Signed-off-by: Darrick J. Wong <***@us.ibm.com>
---

mm/filemap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..c22675f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2287,8 +2287,10 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
gfp_notmask = __GFP_FS;
repeat:
page = find_lock_page(mapping, index);
- if (page)
+ if (page) {
+ wait_on_page_writeback(page);
return page;
+ }

page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
if (!page)
@@ -2301,6 +2303,7 @@ repeat:
goto repeat;
return NULL;
}
+ wait_on_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
Christoph Hellwig
2011-05-04 18:48:03 UTC
Permalink
Post by Darrick J. Wong
When grabbing a page for a buffered IO write, the mm should wait for writeback
on the page to complete so that the page does not become writable during the IO
operation.
---
mm/filemap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..c22675f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2287,8 +2287,10 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
gfp_notmask = __GFP_FS;
page = find_lock_page(mapping, index);
- if (page)
+ if (page) {
+ wait_on_page_writeback(page);
return page;
+ }
goto found;
Post by Darrick J. Wong
page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
if (!page)
goto repeat;
return NULL;
}
+ wait_on_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Loading...