[RFC] qed: Add QEMU Enhanced Disk format

Yippie - yet another disk format :). Let's hope this one survives.

Post by Stefan Hajnoczi
---
This code is also available from git (for development and testing the tracing
and blkverify features are pulled in, whereas this single squashed patch
http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/qed

just looked at it and stumbled over two simple nits.

[snip]

Post by Stefan Hajnoczi
+/**
+ * Get the number of bits for a power of 2
+ *
+ * n == 1 << get_bits_from_size(n)
+ */
+int get_bits_from_size(size_t size)
+{
+ int res = 0;
+
+ if (size == 0) {
+ return -1;
+ }
+
+ while (size != 1) {
+ /* Not a power of two */
+ if (size & 1) {
+ return -1;
+ }
+
+ size >>= 1;
+ res++;
+ }
+
+ return res;
+}

Should be an extra patch - it doesn't hurt to send an RFC patch set. This thing is so big that it's no fun to review :).

Post by Stefan Hajnoczi
+
+const char *bytes_to_str(uint64_t size)
+{
+ static char buffer[64];
+
+ if (size < (1ULL << 10)) {
+ snprintf(buffer, sizeof(buffer), "%" PRIu64 " byte(s)", size);
+ } else if (size < (1ULL << 20)) {
+ snprintf(buffer, sizeof(buffer), "%" PRIu64 " KB(s)", size >> 10);
+ } else if (size < (1ULL << 30)) {
+ snprintf(buffer, sizeof(buffer), "%" PRIu64 " MB(s)", size >> 20);
+ } else if (size < (1ULL << 40)) {
+ snprintf(buffer, sizeof(buffer), "%" PRIu64 " GB(s)", size >> 30);
+ } else {
+ snprintf(buffer, sizeof(buffer), "%" PRIu64 " TB(s)", size >> 40);
+ }
+
+ return buffer;

This returns a variable from the stack! Please make the target buffer caller defined.

Post by Stefan Hajnoczi
+}
diff --git a/qemu-common.h b/qemu-common.h
index dfd3dc0..754b107 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -137,6 +137,8 @@ time_t mktimegm(struct tm *tm);
int qemu_fls(int i);
int qemu_fdatasync(int fd);
int fcntl_setfl(int fd, int flag);
+int get_bits_from_size(size_t size);
+const char *bytes_to_str(uint64_t size);
/* path.c */
void init_paths(const char *prefix);
@@ -283,6 +285,7 @@ void qemu_iovec_destroy(QEMUIOVector *qiov);
void qemu_iovec_reset(QEMUIOVector *qiov);
void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf);
void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf, size_t count);
+void qemu_iovec_zero(QEMUIOVector *qiov);

separate patch please.

Alex

Stefan Hajnoczi

2010-09-06 10:31:59 UTC

Post by Alexander Graf
Should be an extra patch - it doesn't hurt to send an RFC patch set. This thing is so big that it's no fun to review :).

I'll start consolidating commits so the next round will be easier to review.

Stefan

Luca Tettamanti

2010-09-06 14:21:18 UTC

Post by Alexander Graf

This returns a variable from the stack! Please make the target buffer caller defined.

It's static, so it's formally correct. But probably not a good idea :)

Luca

Alexander Graf

2010-09-06 14:24:20 UTC

Post by Luca Tettamanti

Post by Alexander Graf

This returns a variable from the stack! Please make the target buffer caller defined.

It's static, so it's formally correct. But probably not a good idea :)

Oh - I missed the static there. Yeah, it's even worse. This is racy.

Alex

Anthony Liguori

2010-09-06 16:27:13 UTC

Post by Alexander Graf
Oh - I missed the static there. Yeah, it's even worse. This is racy.

It's easy to refactor away so I'll just do that but it's not actually racy.

It's just not re-entrant and the lifetime of the returned result is only
until the next call.

Regards,

Anthony Liguori

Post by Alexander Graf
Alex

Kevin Wolf

2010-09-06 10:27:31 UTC

Okay, so before I actually look at the patch longer than a couple of
seconds let me just ask the obvious question...

Before inventing yet another image format, you certainly have checked
the existing ones. Except for not implementing compression and
encryption this looks a lot like qcow1 to me. I see that you even
retained the two-level cluster tables.

So if we ignore the implementation for a moment and just compare the
formats, what's the crucial difference between qcow1 and qed that I'm
missing? And if it's not qcow1, why not improving our support for
another existing format like VHD?

Kevin

Stefan Hajnoczi

2010-09-06 12:40:07 UTC

Okay, so before I actually look at the patch longer than a couple of
seconds let me just ask the obvious question...
Before inventing yet another image format, you certainly have checked
the existing ones. Except for not implementing compression and
encryption this looks a lot like qcow1 to me. I see that you even
retained the two-level cluster tables.
So if we ignore the implementation for a moment and just compare the
formats, what's the crucial difference between qcow1 and qed that I'm
missing? And if it's not qcow1, why not improving our support for
another existing format like VHD?

Is this a subset of existing on-disk formats? Yes. The motivation is
to have an image format that performs well and is safe, with backing
image support. Currently no image format in QEMU meets these
requirements.

Perhaps it is appropriate to use an existing on-disk format. I
actually considered in-place migration (compatibility) with qcow2 to
make life easier for users and avoid a new format. However, there is
baggage to doing this and the focus should be on building a solid
image format instead of fitting into a legacy format that qemu-img
convert can take care of.

Stefan

Anthony Liguori

2010-09-06 12:57:17 UTC

Post by Stefan Hajnoczi

Okay, so before I actually look at the patch longer than a couple of
seconds let me just ask the obvious question...
Before inventing yet another image format, you certainly have checked
the existing ones. Except for not implementing compression and
encryption this looks a lot like qcow1 to me. I see that you even
retained the two-level cluster tables.
So if we ignore the implementation for a moment and just compare the
formats, what's the crucial difference between qcow1 and qed that I'm
missing? And if it's not qcow1, why not improving our support for
another existing format like VHD?

If you implement a subset of functionality for an existing on-disk
format, I think you damage user's expectations.

If we claim to support qcow images, then given any old qcow image I have
laying around for 5 years ago, I should be able to run it without qemu
throwing an error.

There's some really ugly stuff in qcow. Nothing is actually aligned.
This makes implementing things like O_DIRECT very challenging since you
basically have to handle bouncing any possible buffer. Since the L1
table occurs immediately after the header, there's really no room to
play any kind of tricks to add features.

Regards,

Anthony Liguori

Post by Stefan Hajnoczi
I
actually considered in-place migration (compatibility) with qcow2 to
make life easier for users and avoid a new format. However, there is
baggage to doing this and the focus should be on building a solid
image format instead of fitting into a legacy format that qemu-img
convert can take care of.
Stefan

Stefan Hajnoczi

2010-09-06 13:02:09 UTC

Okay, so before I actually look at the patch longer than a couple of
seconds let me just ask the obvious question...
Before inventing yet another image format, you certainly have checked
the existing ones. Except for not implementing compression and
encryption this looks a lot like qcow1 to me. I see that you even
retained the two-level cluster tables.
So if we ignore the implementation for a moment and just compare the
formats, what's the crucial difference between qcow1 and qed that I'm
missing? And if it's not qcow1, why not improving our support for
another existing format like VHD?

If you implement a subset of functionality for an existing on-disk format, I
think you damage user's expectations.
If we claim to support qcow images, then given any old qcow image I have
laying around for 5 years ago, I should be able to run it without qemu
throwing an error.
There's some really ugly stuff in qcow. Nothing is actually aligned. This
makes implementing things like O_DIRECT very challenging since you basically
have to handle bouncing any possible buffer. Since the L1 table occurs
immediately after the header, there's really no room to play any kind of
tricks to add features.

These are the details that are baggage. Ultimately it may be hard to
deal with them without just bumping the qcow version number and
thereby having a new format anyway.

Stefan

Kevin Wolf

2010-09-06 14:10:52 UTC

Post by Stefan Hajnoczi

Okay, so before I actually look at the patch longer than a couple of
seconds let me just ask the obvious question...
Before inventing yet another image format, you certainly have checked
the existing ones. Except for not implementing compression and
encryption this looks a lot like qcow1 to me. I see that you even
retained the two-level cluster tables.
So if we ignore the implementation for a moment and just compare the
formats, what's the crucial difference between qcow1 and qed that I'm
missing? And if it's not qcow1, why not improving our support for
another existing format like VHD?

If you implement a subset of functionality for an existing on-disk
format, I think you damage user's expectations.

I don't really buy that implementing compression/encryption wouldn't
have been possible if it was the only problem. Of course, if you don't
implement it, you can't use an on-disk format that supports them.

Post by Anthony Liguori
If we claim to support qcow images, then given any old qcow image I have
laying around for 5 years ago, I should be able to run it without qemu
throwing an error.
There's some really ugly stuff in qcow. Nothing is actually aligned.
This makes implementing things like O_DIRECT very challenging since you
basically have to handle bouncing any possible buffer. Since the L1
table occurs immediately after the header, there's really no room to
play any kind of tricks to add features.

That's a good point actually. I didn't remember that.

Kevin

Anthony Liguori

2010-09-06 16:45:27 UTC

Post by Anthony Liguori
If you implement a subset of functionality for an existing on-disk
format, I think you damage user's expectations.

The trouble with compression is that you don't have fixed size clusters
any more. In order to support writes, you either have to write
uncompressed data to the EOF leaking the compressed version or write
compressed data and attempt to use a free list to avoid leaking
clusters. Since cluster size isn't fixed, the free list is of variable
size which means you'd have to do something sophisticated like a buddy
algorithm to allocate from the free list.

It's just not worth it since there's no easy way to do it correctly.

Encryption is straight forward.

Lack of features is a killer though. The only thing you could really do
is the same type of trickery we did with qcow2 where we detect whether
there's room between the header and the L1. Of course, there's nothing
in qcow that really says if the L1 doesn't start at sizeof(old_header)
then you have new_header so this is not technically backwards compatible.

But even assuming it is, the new features introduced in new_header are
undiscoverable to older version of QEMU. So if you do something that
makes the image unreadable to older QEMUs (like adding a new encryption
algorithm), instead of getting a nice error, you get silent corruption.
qcow has had more than the QEMU implementation too so we're not the only
ones that have been creating qcow images so we can't just rely on our
historic behavior.

IMHO, this alone justifies a new format.

Regards,

Anthony Liguori

That's a good point actually. I didn't remember that.
Kevin

Anthony Liguori

2010-09-06 12:45:31 UTC

Post by Kevin Wolf
Okay, so before I actually look at the patch longer than a couple of
seconds let me just ask the obvious question...
Before inventing yet another image format, you certainly have checked
the existing ones.

Obviously, yes.

Here are the issues:

cow.c: it's cow of an otherwise sparse file. An important reason for
implementing a format is the ability to copy (or scp) an image without
special tools.

qcow2.c: the refcount, cow cluster, and compression make an
implementation seeking integrity and performance challenging.

vmdk.c: we feel it's important for qemu to have a block format with a
gpl friendly specification that we have a say in

vhd/vpc.c: same as vmdk with the addition that the OSP is known to not
be gpl friendly

vdi.c: uses a bitmap instead of a two level table. An advantage of a
two level table is that it allows image resize without much fuss.

qcow.c: it lacks extensibility and compression means that there's no
guarantee that blocks are a fixed size. This makes it very difficult to
implement a high performance block format without having two separate
code paths.

Post by Kevin Wolf
Except for not implementing compression and
encryption this looks a lot like qcow1 to me. I see that you even
retained the two-level cluster tables.
So if we ignore the implementation for a moment and just compare the
formats, what's the crucial difference between qcow1 and qed that I'm
missing? And if it's not qcow1, why not improving our support for
another existing format like VHD?

Block formats are easy to get wrong. QED is an existence proof that
given the right constraints, we can build a full asynchronous, high
performance image format with proper data integrity.

You could get to QED by incrementally improving qcow but you'd have to
break the format to make it extensible and disable support for
compression. But at that point, why not just make a new format since
you're breaking compatibility.

You would have to fully rewrite the code so what's the point of keeping
the format?

Regards,

Anthony Liguori

Post by Kevin Wolf
Kevin

Daniel P. Berrange

2010-09-06 11:18:21 UTC

I agree with ditching compression, but encryption is an important
capability which cannot be satisfactorily added at other layers
in the stack. While block devices / local filesystems can layer
in dm-crypt in the host, this is not possible with network/cluster
filesystems which account for a non-trivial target audience. Adding
encryption inside the guest is sub-optimal because you cannot do
secure automation of guest startup. Either you require manaual
intervention to start every guest to enter the key, or if you
hardcode the key, then anyone who can access the guest disk image
can start the guest. The qcow2 encryption is the perfect solution
for this problem, guaranteeing the data security even when the
storage system / network transport offers no security, and allowing
for secure control over guest startup. Further, adding encryptiuon
does not add any serious complexity to the on disk format - just
1 extra header field, nor to the implmenetation - just pass the
data block through a encrypt/decrypt filter, with no extra I/O
paths.

Post by Stefan Hajnoczi
diff --git a/block/qed-cluster.c b/block/qed-cluster.c
new file mode 100644
index 0000000..6deea27
--- /dev/null
+++ b/block/qed-cluster.c
@@ -0,0 +1,136 @@
+/*
+ * QEMU Enhanced Disk Format Cluster functions
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+/**
+ * Count the number of contiguous data clusters
+ *
+ *
+ * This function scans tables for contiguous allocated or free clusters.
+ */
+static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+ QEDTable *table,
+ unsigned int index,
+ unsigned int n,
+ uint64_t *offset)
+{
+ unsigned int end = MIN(index + n, s->table_nelems);
+ uint64_t last = table->offsets[index];
+ unsigned int i;
+
+ *offset = last;
+
+ for (i = index + 1; i < end; i++) {
+ if (last == 0) {
+ /* Counting free clusters */
+ if (table->offsets[i] != 0) {
+ break;
+ }
+ } else {
+ /* Counting allocated clusters */
+ if (table->offsets[i] != last + s->header.cluster_size) {
+ break;
+ }
+ last = table->offsets[i];
+ }
+ }
+ return i - index;
+}
+
+typedef struct {
+ BDRVQEDState *s;
+ uint64_t pos;
+ size_t len;
+
+ QEDRequest *request;
+
+ /* User callback */
+ QEDFindClusterFunc *cb;
+ void *opaque;
+} QEDFindClusterCB;
+
+static void qed_find_cluster_cb(void *opaque, int ret)
+{
+ QEDFindClusterCB *find_cluster_cb = opaque;
+ BDRVQEDState *s = find_cluster_cb->s;
+ QEDRequest *request = find_cluster_cb->request;
+ uint64_t offset = 0;
+ size_t len = 0;
+ unsigned int index;
+ unsigned int n;
+
+ if (ret) {
+ ret = QED_CLUSTER_ERROR;
+ goto out;
+ }
+
+ index = qed_l2_index(s, find_cluster_cb->pos);
+ n = qed_bytes_to_clusters(s,
+ qed_offset_into_cluster(s, find_cluster_cb->pos) +
+ find_cluster_cb->len);
+ n = qed_count_contiguous_clusters(s, request->l2_table->table,
+ index, n, &offset);
+
+ ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
+ len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
+ qed_offset_into_cluster(s, find_cluster_cb->pos));
+
+ find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
+ qemu_free(find_cluster_cb);
+}
+
+/**
+ * Find the offset of a data cluster
+ *
+ */
+void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+ size_t len, QEDFindClusterFunc *cb, void *opaque)
+{
+ QEDFindClusterCB *find_cluster_cb;
+ uint64_t l2_offset;
+
+ /* Limit length to L2 boundary. Requests are broken up at the L2 boundary
+ * so that a request acts on one L2 table at a time.
+ */
+ len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
+
+ l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
+ if (!l2_offset) {
+ cb(opaque, QED_CLUSTER_L1, 0, len);
+ return;
+ }
+
+ find_cluster_cb = qemu_malloc(sizeof(*find_cluster_cb));
+ find_cluster_cb->s = s;
+ find_cluster_cb->pos = pos;
+ find_cluster_cb->len = len;
+ find_cluster_cb->cb = cb;
+ find_cluster_cb->opaque = opaque;
+ find_cluster_cb->request = request;
+
+ qed_read_l2_table(s, request, l2_offset,
+ qed_find_cluster_cb, find_cluster_cb);
+}
diff --git a/block/qed-gencb.c b/block/qed-gencb.c
new file mode 100644
index 0000000..d389e12
--- /dev/null
+++ b/block/qed-gencb.c
@@ -0,0 +1,32 @@
+/*
+ * QEMU Enhanced Disk Format
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque)
+{
+ GenericCB *gencb = qemu_malloc(len);
+ gencb->cb = cb;
+ gencb->opaque = opaque;
+ return gencb;
+}
+
+void gencb_complete(void *opaque, int ret)
+{
+ GenericCB *gencb = opaque;
+ BlockDriverCompletionFunc *cb = gencb->cb;
+ void *user_opaque = gencb->opaque;
+
+ qemu_free(gencb);
+ cb(user_opaque, ret);
+}
diff --git a/block/qed-l2-cache.c b/block/qed-l2-cache.c
new file mode 100644
index 0000000..747a629
--- /dev/null
+++ b/block/qed-l2-cache.c
@@ -0,0 +1,131 @@
+/*
+ * QEMU Enhanced Disk Format L2 Cache
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+/* Each L2 holds 2GB so this let's us fully cache a 100GB disk */
+#define MAX_L2_CACHE_SIZE 50
+
+/**
+ * Initialize the L2 cache
+ */
+void qed_init_l2_cache(L2TableCache *l2_cache,
+ L2TableAllocFunc *alloc_l2_table,
+ void *alloc_l2_table_opaque)
+{
+ QTAILQ_INIT(&l2_cache->entries);
+ l2_cache->n_entries = 0;
+ l2_cache->alloc_l2_table = alloc_l2_table;
+ l2_cache->alloc_l2_table_opaque = alloc_l2_table_opaque;
+}
+
+/**
+ * Free the L2 cache
+ */
+void qed_free_l2_cache(L2TableCache *l2_cache)
+{
+ CachedL2Table *entry, *next_entry;
+
+ QTAILQ_FOREACH_SAFE(entry, &l2_cache->entries, node, next_entry) {
+ qemu_free(entry->table);
+ qemu_free(entry);
+ }
+}
+
+/**
+ * Allocate an uninitialized entry from the cache
+ *
+ * The returned entry has a reference count of 1 and is owned by the caller.
+ */
+CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache)
+{
+ CachedL2Table *entry;
+
+ entry = qemu_mallocz(sizeof(*entry));
+ entry->table = l2_cache->alloc_l2_table(l2_cache->alloc_l2_table_opaque);
+ entry->ref++;
+
+ return entry;
+}
+
+/**
+ * Decrease an entry's reference count and free if necessary when the reference
+ * count drops to zero.
+ */
+void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry)
+{
+ if (!entry) {
+ return;
+ }
+
+ entry->ref--;
+ if (entry->ref == 0) {
+ qemu_free(entry->table);
+ qemu_free(entry);
+ }
+}
+
+/**
+ * Find an entry in the L2 cache. This may return NULL and it's up to the
+ * caller to satisfy the cache miss.
+ *
+ * For a cached entry, this function increases the reference count and returns
+ * the entry.
+ */
+CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset)
+{
+ CachedL2Table *entry;
+
+ QTAILQ_FOREACH(entry, &l2_cache->entries, node) {
+ if (entry->offset == offset) {
+ entry->ref++;
+ return entry;
+ }
+ }
+ return NULL;
+}
+
+/**
+ * Commit an L2 cache entry into the cache. This is meant to be used as part of
+ * the process to satisfy a cache miss. A caller would allocate an entry which
+ * is not actually in the L2 cache and then once the entry was valid and
+ * present on disk, the entry can be committed into the cache.
+ *
+ * Since the cache is write-through, it's important that this function is not
+ * called until the entry is present on disk and the L1 has been updated to
+ * point to the entry.
+ *
+ * This function will take a reference to the entry so the caller is still
+ * responsible for unreferencing the entry.
+ */
+void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table)
+{
+ CachedL2Table *entry;
+
+ entry = qed_find_l2_cache_entry(l2_cache, l2_table->offset);
+ if (entry) {
+ qed_unref_l2_cache_entry(l2_cache, entry);
+ return;
+ }
+
+ if (l2_cache->n_entries >= MAX_L2_CACHE_SIZE) {
+ entry = QTAILQ_FIRST(&l2_cache->entries);
+ QTAILQ_REMOVE(&l2_cache->entries, entry, node);
+ l2_cache->n_entries--;
+ qed_unref_l2_cache_entry(l2_cache, entry);
+ }
+
+ l2_table->ref++;
+ l2_cache->n_entries++;
+ QTAILQ_INSERT_TAIL(&l2_cache->entries, l2_table, node);
+}
diff --git a/block/qed-table.c b/block/qed-table.c
new file mode 100644
index 0000000..9a72582
--- /dev/null
+++ b/block/qed-table.c
@@ -0,0 +1,242 @@
+/*
+ * QEMU Enhanced Disk Format Table I/O
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+typedef struct {
+ GenericCB gencb;
+ BDRVQEDState *s;
+ QEDTable *table;
+
+ struct iovec iov;
+ QEMUIOVector qiov;
+} QEDReadTableCB;
+
+static void qed_read_table_cb(void *opaque, int ret)
+{
+ QEDReadTableCB *read_table_cb = opaque;
+ QEDTable *table = read_table_cb->table;
+ int noffsets = read_table_cb->iov.iov_len / sizeof(uint64_t);
+ int i;
+
+ /* Handle I/O error */
+ if (ret) {
+ goto out;
+ }
+
+ /* Byteswap and verify offsets */
+ for (i = 0; i < noffsets; i++) {
+ table->offsets[i] = le64_to_cpu(table->offsets[i]);
+ }
+
+ /* Completion */
+ gencb_complete(&read_table_cb->gencb, ret);
+}
+
+static void qed_read_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+ BlockDriverCompletionFunc *cb, void *opaque)
+{
+ QEDReadTableCB *read_table_cb = gencb_alloc(sizeof(*read_table_cb),
+ cb, opaque);
+ QEMUIOVector *qiov = &read_table_cb->qiov;
+ BlockDriverAIOCB *aiocb;
+
+ read_table_cb->s = s;
+ read_table_cb->table = table;
+ read_table_cb->iov.iov_base = table->offsets,
+ read_table_cb->iov.iov_len = s->header.cluster_size * s->header.table_size,
+
+ qemu_iovec_init_external(qiov, &read_table_cb->iov, 1);
+ aiocb = bdrv_aio_readv(s->bs->file, offset / BDRV_SECTOR_SIZE, qiov,
+ read_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
+ qed_read_table_cb, read_table_cb);
+ if (!aiocb) {
+ qed_read_table_cb(read_table_cb, -EIO);
+ }
+}
+
+typedef struct {
+ GenericCB gencb;
+ BDRVQEDState *s;
+ QEDTable *orig_table;
+ bool flush; /* flush after write? */
+
+ struct iovec iov;
+ QEMUIOVector qiov;
+
+ QEDTable table;
+} QEDWriteTableCB;
+
+static void qed_write_table_cb(void *opaque, int ret)
+{
+ QEDWriteTableCB *write_table_cb = opaque;
+
+ if (ret) {
+ goto out;
+ }
+
+ if (write_table_cb->flush) {
+ /* We still need to flush first */
+ write_table_cb->flush = false;
+ bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
+ write_table_cb);
+ return;
+ }
+
+ gencb_complete(&write_table_cb->gencb, ret);
+ return;
+}
+
+/**
+ * Write out an updated part or all of a table
+ *
+ */
+static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+ unsigned int index, unsigned int n, bool flush,
+ BlockDriverCompletionFunc *cb, void *opaque)
+{
+ QEDWriteTableCB *write_table_cb;
+ BlockDriverAIOCB *aiocb;
+ unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
+ unsigned int start, end, i;
+ size_t len_bytes;
+
+ /* Calculate indices of the first and one after last elements */
+ start = index & ~sector_mask;
+ end = (index + n + sector_mask) & ~sector_mask;
+
+ len_bytes = (end - start) * sizeof(uint64_t);
+
+ write_table_cb = gencb_alloc(sizeof(*write_table_cb) + len_bytes,
+ cb, opaque);
+ write_table_cb->s = s;
+ write_table_cb->orig_table = table;
+ write_table_cb->flush = flush;
+ write_table_cb->iov.iov_base = write_table_cb->table.offsets;
+ write_table_cb->iov.iov_len = len_bytes;
+ qemu_iovec_init_external(&write_table_cb->qiov, &write_table_cb->iov, 1);
+
+ /* Byteswap table */
+ for (i = start; i < end; i++) {
+ write_table_cb->table.offsets[i - start] = cpu_to_le64(table->offsets[i]);
+ }
+
+ /* Adjust for offset into table */
+ offset += start * sizeof(uint64_t);
+
+ aiocb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
+ &write_table_cb->qiov,
+ write_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
+ qed_write_table_cb, write_table_cb);
+ if (!aiocb) {
+ qed_write_table_cb(write_table_cb, -EIO);
+ }
+}
+
+static void qed_read_l1_table_cb(void *opaque, int ret)
+{
+ *(int *)opaque = ret;
+}
+
+/**
+ * Read the L1 table synchronously
+ */
+int qed_read_l1_table(BDRVQEDState *s)
+{
+ int ret = -EINPROGRESS;
+
+ /* TODO push/pop async context? */
+
+ qed_read_table(s, s->header.l1_table_offset,
+ s->l1_table, qed_read_l1_table_cb, &ret);
+ while (ret == -EINPROGRESS) {
+ qemu_aio_wait();
+ }
+ return ret;
+}
+
+void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
+ BlockDriverCompletionFunc *cb, void *opaque)
+{
+ qed_write_table(s, s->header.l1_table_offset,
+ s->l1_table, index, n, false, cb, opaque);
+}
+
+typedef struct {
+ GenericCB gencb;
+ BDRVQEDState *s;
+ uint64_t l2_offset;
+ QEDRequest *request;
+} QEDReadL2TableCB;
+
+static void qed_read_l2_table_cb(void *opaque, int ret)
+{
+ QEDReadL2TableCB *read_l2_table_cb = opaque;
+ QEDRequest *request = read_l2_table_cb->request;
+ BDRVQEDState *s = read_l2_table_cb->s;
+
+ if (ret) {
+ /* can't trust loaded L2 table anymore */
+ qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
+ request->l2_table = NULL;
+ } else {
+ request->l2_table->offset = read_l2_table_cb->l2_offset;
+ qed_commit_l2_cache_entry(&s->l2_cache, request->l2_table);
+ }
+
+ gencb_complete(&read_l2_table_cb->gencb, ret);
+}
+
+void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+ BlockDriverCompletionFunc *cb, void *opaque)
+{
+ QEDReadL2TableCB *read_l2_table_cb;
+
+ qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
+
+ /* Check for cached L2 entry */
+ request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
+ if (request->l2_table) {
+ cb(opaque, 0);
+ return;
+ }
+
+ request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+
+ read_l2_table_cb = gencb_alloc(sizeof(*read_l2_table_cb), cb, opaque);
+ read_l2_table_cb->s = s;
+ read_l2_table_cb->l2_offset = offset;
+ read_l2_table_cb->request = request;
+
+ qed_read_table(s, offset, request->l2_table->table,
+ qed_read_l2_table_cb, read_l2_table_cb);
+}
+
+void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+ unsigned int index, unsigned int n, bool flush,
+ BlockDriverCompletionFunc *cb, void *opaque)
+{
+ qed_write_table(s, request->l2_table->offset,
+ request->l2_table->table, index, n, flush, cb, opaque);
+}
diff --git a/block/qed.c b/block/qed.c
new file mode 100644
index 0000000..cf64418
--- /dev/null
+++ b/block/qed.c
@@ -0,0 +1,1103 @@
+/*
+ * QEMU Enhanced Disk Format
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+/* TODO blkdebug support */
+/* TODO BlockDriverState::buffer_alignment */
+/* TODO check L2 table sizes before accessing them? */
+/* TODO skip zero prefill since the filesystem should zero the sectors anyway */
+/* TODO if a table element's offset is invalid then the image is broken. If
+ * there was a power failure and the table update reached storage but the data
+ * being pointed to did not, forget about the lost data by clearing the offset.
+ * However, need to be careful to detect invalid offsets for tables that are
+ * read *after* more clusters have been allocated. */
+
+enum {
+ QED_MAGIC = 'Q' | 'E' << 8 | 'D' << 16 | '\0' << 24,
+
+ /* The image supports a backing file */
+ QED_F_BACKING_FILE = 0x01,
+
+ /* The image has the backing file format */
+ QED_CF_BACKING_FORMAT = 0x01,
+
+ /* Feature bits must be used when the on-disk format changes */
+ QED_FEATURE_MASK = QED_F_BACKING_FILE, /* supported feature bits */
+ QED_COMPAT_FEATURE_MASK = QED_CF_BACKING_FORMAT, /* supported compat feature bits */
+
+ /* Data is stored in groups of sectors called clusters. Cluster size must
+ * be large to avoid keeping too much metadata. I/O requests that have
+ * sub-cluster size will require read-modify-write.
+ */
+ QED_MIN_CLUSTER_SIZE = 4 * 1024, /* in bytes */
+ QED_MAX_CLUSTER_SIZE = 64 * 1024 * 1024,
+ QED_DEFAULT_CLUSTER_SIZE = 64 * 1024,
+
+ /* Allocated clusters are tracked using a 2-level pagetable. Table size is
+ * a multiple of clusters so large maximum image sizes can be supported
+ * without jacking up the cluster size too much.
+ */
+ QED_MIN_TABLE_SIZE = 1, /* in clusters */
+ QED_MAX_TABLE_SIZE = 16,
+ QED_DEFAULT_TABLE_SIZE = 4,
+};
+
+static void qed_aio_cancel(BlockDriverAIOCB *acb)
+{
+ qemu_aio_release(acb);
+}
+
+static AIOPool qed_aio_pool = {
+ .aiocb_size = sizeof(QEDAIOCB),
+ .cancel = qed_aio_cancel,
+};
+
+/**
+ * Allocate memory that satisfies image file and backing file alignment requirements
+ *
+ * TODO make this common and consider propagating max buffer_alignment to the root image
+ */
+static void *qed_memalign(BDRVQEDState *s, size_t len)
+{
+ size_t align = s->bs->file->buffer_alignment;
+ BlockDriverState *backing_hd = s->bs->backing_hd;
+
+ if (backing_hd && backing_hd->buffer_alignment > align) {
+ align = backing_hd->buffer_alignment;
+ }
+
+ return qemu_memalign(align, len);
+}
+
+static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
+ const char *filename)
+{
+ const QEDHeader *header = (const void *)buf;
+
+ if (buf_size < sizeof(*header)) {
+ return 0;
+ }
+ if (le32_to_cpu(header->magic) != QED_MAGIC) {
+ return 0;
+ }
+ return 100;
+}
+
+static void qed_header_le_to_cpu(const QEDHeader *le, QEDHeader *cpu)
+{
+ cpu->magic = le32_to_cpu(le->magic);
+ cpu->cluster_size = le32_to_cpu(le->cluster_size);
+ cpu->table_size = le32_to_cpu(le->table_size);
+ cpu->first_cluster = le32_to_cpu(le->first_cluster);
+ cpu->features = le64_to_cpu(le->features);
+ cpu->compat_features = le64_to_cpu(le->compat_features);
+ cpu->l1_table_offset = le64_to_cpu(le->l1_table_offset);
+ cpu->image_size = le64_to_cpu(le->image_size);
+ cpu->backing_file_offset = le32_to_cpu(le->backing_file_offset);
+ cpu->backing_file_size = le32_to_cpu(le->backing_file_size);
+ cpu->backing_fmt_offset = le32_to_cpu(le->backing_fmt_offset);
+ cpu->backing_fmt_size = le32_to_cpu(le->backing_fmt_size);
+}
+
+static void qed_header_cpu_to_le(const QEDHeader *cpu, QEDHeader *le)
+{
+ le->magic = cpu_to_le32(cpu->magic);
+ le->cluster_size = cpu_to_le32(cpu->cluster_size);
+ le->table_size = cpu_to_le32(cpu->table_size);
+ le->first_cluster = cpu_to_le32(cpu->first_cluster);
+ le->features = cpu_to_le64(cpu->features);
+ le->compat_features = cpu_to_le64(cpu->compat_features);
+ le->l1_table_offset = cpu_to_le64(cpu->l1_table_offset);
+ le->image_size = cpu_to_le64(cpu->image_size);
+ le->backing_file_offset = cpu_to_le32(cpu->backing_file_offset);
+ le->backing_file_size = cpu_to_le32(cpu->backing_file_size);
+ le->backing_fmt_offset = cpu_to_le32(cpu->backing_fmt_offset);
+ le->backing_fmt_size = cpu_to_le32(cpu->backing_fmt_size);
+}
+
+static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
+{
+ uint64_t table_entries;
+ uint64_t l2_size;
+
+ table_entries = (table_size * cluster_size) / 8;
+ l2_size = table_entries * cluster_size;
+
+ return l2_size * table_entries;
+}
+
+static bool qed_is_cluster_size_valid(uint32_t cluster_size)
+{
+ if (cluster_size < QED_MIN_CLUSTER_SIZE ||
+ cluster_size > QED_MAX_CLUSTER_SIZE) {
+ return false;
+ }
+ if (cluster_size & (cluster_size - 1)) {
+ return false; /* not power of 2 */
+ }
+ return true;
+}
+
+static bool qed_is_table_size_valid(uint32_t table_size)
+{
+ if (table_size < QED_MIN_TABLE_SIZE ||
+ table_size > QED_MAX_TABLE_SIZE) {
+ return false;
+ }
+ if (table_size & (table_size - 1)) {
+ return false; /* not power of 2 */
+ }
+ return true;
+}
+
+static bool qed_is_image_size_valid(uint64_t image_size, uint32_t cluster_size,
+ uint32_t table_size)
+{
+ if (image_size == 0) {
+ /* Supporting zero size images makes life harder because even the L1
+ * table is not needed. Make life simple and forbid zero size images.
+ */
+ return false;
+ }
+ if (image_size & (cluster_size - 1)) {
+ return false; /* not multiple of cluster size */
+ }
+ if (image_size > qed_max_image_size(cluster_size, table_size)) {
+ return false; /* image is too large */
+ }
+ return true;
+}
+
+/**
+ * Test if a byte offset is cluster aligned and within the image file
+ */
+static bool qed_check_byte_offset(BDRVQEDState *s, uint64_t offset)
+{
+ if (offset & (s->header.cluster_size - 1)) {
+ return false;
+ }
+ if (offset == 0) {
+ return false; /* first cluster contains the header and is not valid */
+ }
+ return offset < s->file_size;
+}
+
+/**
+ * Read a string of known length from the image file
+ *
+ *
+ * The string is NUL-terminated.
+ */
+static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
+ char *buf, size_t buflen)
+{
+ int ret;
+ if (n >= buflen) {
+ return -EINVAL;
+ }
+ ret = bdrv_pread(file, offset, buf, n);
+ if (ret != n) {
+ return ret;
+ }
+ buf[n] = '\0';
+ return 0;
+}
+
+/**
+ * Allocate new clusters
+ *
+ */
+static int qed_alloc_clusters(BDRVQEDState *s, unsigned int n, uint64_t *offset)
+{
+ *offset = s->file_size;
+ s->file_size += n * s->header.cluster_size;
+ return 0;
+}
+
+static QEDTable *qed_alloc_table(void *opaque)
+{
+ BDRVQEDState *s = opaque;
+
+ /* Honor O_DIRECT memory alignment requirements */
+ return qed_memalign(s, s->header.cluster_size * s->header.table_size);
+}
+
+/**
+ * Allocate a new zeroed L2 table
+ */
+static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+{
+ uint64_t offset;
+ int ret;
+ CachedL2Table *l2_table;
+
+ ret = qed_alloc_clusters(s, s->header.table_size, &offset);
+ if (ret) {
+ return NULL;
+ }
+
+ l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+ l2_table->offset = offset;
+
+ memset(l2_table->table->offsets, 0,
+ s->header.cluster_size * s->header.table_size);
+ return l2_table;
+}
+
+static int bdrv_qed_open(BlockDriverState *bs, int flags)
+{
+ BDRVQEDState *s = bs->opaque;
+ QEDHeader le_header;
+ int64_t file_size;
+ int ret;
+
+ s->bs = bs;
+ QSIMPLEQ_INIT(&s->allocating_write_reqs);
+
+ ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
+ if (ret != sizeof(le_header)) {
+ return ret;
+ }
+ qed_header_le_to_cpu(&le_header, &s->header);
+
+ if (s->header.magic != QED_MAGIC) {
+ return -ENOENT;
+ }
+ if (s->header.features & ~QED_FEATURE_MASK) {
+ return -ENOTSUP; /* image uses unsupported feature bits */
+ }
+ if (!qed_is_cluster_size_valid(s->header.cluster_size)) {
+ return -EINVAL;
+ }
+
+ /* Round up file size to the next cluster */
+ file_size = bdrv_getlength(bs->file);
+ if (file_size < 0) {
+ return file_size;
+ }
+ s->file_size = qed_start_of_cluster(s, file_size + s->header.cluster_size - 1);
+
+ if (!qed_is_table_size_valid(s->header.table_size)) {
+ return -EINVAL;
+ }
+ if (!qed_is_image_size_valid(s->header.image_size,
+ s->header.cluster_size,
+ s->header.table_size)) {
+ return -EINVAL;
+ }
+ if (!qed_check_byte_offset(s, s->header.l1_table_offset)) {
+ return -EINVAL;
+ }
+
+ s->table_nelems = (s->header.cluster_size * s->header.table_size) /
+ sizeof(s->l1_table->offsets[0]);
+ s->l2_shift = get_bits_from_size(s->header.cluster_size);
+ s->l2_mask = s->table_nelems - 1;
+ s->l1_shift = s->l2_shift + get_bits_from_size(s->l2_mask + 1);
+
+ if ((s->header.features & QED_F_BACKING_FILE)) {
+ ret = qed_read_string(bs->file, s->header.backing_file_offset,
+ s->header.backing_file_size, bs->backing_file,
+ sizeof(bs->backing_file));
+ if (ret < 0) {
+ return ret;
+ }
+
+ if ((s->header.compat_features & QED_CF_BACKING_FORMAT)) {
+ ret = qed_read_string(bs->file, s->header.backing_fmt_offset,
+ s->header.backing_fmt_size,
+ bs->backing_format,
+ sizeof(bs->backing_format));
+ if (ret < 0) {
+ return ret;
+ }
+ }

IMHO we should make the backing format compulsory with use of
the backing file. The only time probing is required is when
initially creating the child image, thereafter there's no
benefit to probing again.

Regards,
Daniel

--
|: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Anthony Liguori

2010-09-06 12:52:41 UTC

Post by Daniel P. Berrange
I agree with ditching compression, but encryption is an important
capability which cannot be satisfactorily added at other layers
in the stack. While block devices / local filesystems can layer
in dm-crypt in the host, this is not possible with network/cluster
filesystems which account for a non-trivial target audience.

ecryptfs should work with NFS these days. If it still doesn't, it will
in the not too distant future.

Post by Daniel P. Berrange
Adding
encryption inside the guest is sub-optimal because you cannot do
secure automation of guest startup. Either you require manaual
intervention to start every guest to enter the key, or if you
hardcode the key, then anyone who can access the guest disk image
can start the guest.

I think this belongs in the VFS level but from a format perspective, an
encryption feature would be easy to add.

+
+ if ((s->header.compat_features& QED_CF_BACKING_FORMAT)) {
+ ret = qed_read_string(bs->file, s->header.backing_fmt_offset,
+ s->header.backing_fmt_size,
+ bs->backing_format,
+ sizeof(bs->backing_format));
+ if (ret< 0) {
+ return ret;
+ }
+ }

Stefan originally made it mandatory but I asked to make it optional.

From a format specification perspective, backing_fmt introduces some
problems. What does a backing_fmt of 'vmdk' mean outside of qemu?

More importantly, humans to create image formats by hand. Instead, they
use tools like qemu-img. If you think we should for the specification
of a backing file format in qemu-img, that's the place we should do it.

Regards,

Anthony Liguori

Post by Daniel P. Berrange
Regards,
Daniel

Daniel P. Berrange

2010-09-06 13:35:45 UTC

ecryptfs should work with NFS these days. If it still doesn't, it will
in the not too distant future.

Assuming it does work with NFS, IIUC, that still requires the user to
have root privileges to setup ecryptfs for the NFS mount in question.
So it takes care of the use case where the host admin doesn't trust
the network/remote fs admin, but doesn't work for the case of local
unprivileged users with NFS home dirs & a host admin who doesnt help.

I think this belongs in the VFS level but from a format perspective, an
encryption feature would be easy to add.

As currently implemented the string refers to a QEMU block driver
which is perhaps not the best choice for a general purpose file
format, if we want this applicable to other non-QEMU apps. Perhaps
it would be better if we explicitly declared backing format as an
enumerated int that represents specific file formats, thus decoupling
it from a specific driver.

Another related idea is perhaps to specify that if backing_fmt is
omitted in the metadata, the backing file must be treated as a QED
format file, rather than probed. Arguably qemu's VMDK driver should
be treating all VMDK backing files as VMDK format rather than probing
since I'm not VMware has no idea of a backing file in qcow or any
other format.

Post by Anthony Liguori
More importantly, humans to create image formats by hand. Instead, they
use tools like qemu-img. If you think we should for the specification
of a backing file format in qemu-img, that's the place we should do it.

Certainly qemu-img can always add a format, even if the specification
declared it optional, but I think its worth considering declaring it
it compulsory in the spec, to take that variable out of the equation
for apps using the images.

Regards,
Daniel

Anthony Liguori

2010-09-06 16:38:52 UTC

ecryptfs should work with NFS these days. If it still doesn't, it will
in the not too distant future.

There's talk of moving ecryptfs from a stackable file system to a VFS
feature. Among other things, this would make it usable by
non-privileged users since there's really no reason for it to not be.

Let's take a step back though as I'd like to point out two things. The
first has feature support which means that if it's just a matter of
adding something to the header and encrypting blocks, then it's super
easy to add. Furthermore, you get graceful detection of failure when
using an encrypted image with a version of QEMU that doesn't support
encryption in QED. When creating new images that aren't encrypted with
the new QEMU, the images still work with old QEMUs.

So really, there's little rush to add encryption (or any feature) to
QED. The main focus ATM is making we achieve good performance and good
reliability.

But encryption is never simple. If you want anymore more than a toy,
you really need to integrate into a key ring system, make use of a
crypto API to leverage cryptographic accelerators, etc. This is why
relying on the a filesystem (or VFS feature) makes so much sense.

Post by Daniel P. Berrange
As currently implemented the string refers to a QEMU block driver
which is perhaps not the best choice for a general purpose file
format, if we want this applicable to other non-QEMU apps. Perhaps
it would be better if we explicitly declared backing format as an
enumerated int that represents specific file formats, thus decoupling
it from a specific driver.

That's one of the reasons I made this an optional feature. I think
we're going to have to revisit the backing format in the future to be
something more meaningful.

For the purposes of the spec, I was going to say that backing_fmt was a
suggestion to an implementation on how to interpret backing_file and
leave it at that.

It terms of making something that's strictly enforced, I would suggest
not specifying the format but rather having something like
is_backing_raw. IOW, a boolean that would be set if the backing file
was raw (and not probe-able). Otherwise, the backing format can be
safely probed.

I would then say that backing file cannot be raw unless that bit is set
or something like that.

Post by Daniel P. Berrange
Another related idea is perhaps to specify that if backing_fmt is
omitted in the metadata, the backing file must be treated as a QED
format file, rather than probed.

!raw would be a better way of specifying it but yeah, I think it's a
reasonable idea.

Regards,

Anthony Liguori

Anthony Liguori

2010-09-06 13:06:00 UTC

Another point worth mentioning is that our intention is to have a formal
specification of the format before merging. A start of that is located
at http://wiki.qemu.org/Features/QED

Regards,

Anthony Liguori

Avi Kivity

2010-09-07 14:51:51 UTC

Post by Anthony Liguori
Another point worth mentioning is that our intention is to have a
formal specification of the format before merging. A start of that is
located at http://wiki.qemu.org/Features/QED
=Specification=
+---------+---------+---------+-----+
| extent0 | extent1 | extent1 | ... |
+---------+---------+---------+-----+
The first extent contains a header. The header contains information
about the first data extent. A data extent may be a data cluster, an
L2, or an L1 table. L1 and L2 tables are composed of one or more
contiguous extents.
==Header==
Header {
uint32_t magic; /* QED\0 */

Endianness?

Post by Anthony Liguori
uint32_t cluster_size; /* in bytes */

Does cluster == extent? If so, use the same terminology. If not, explain.

Usually extent is a variable size structure.

Post by Anthony Liguori
uint32_t table_size; /* table size, in clusters */

Presumably L1 table size? Or any table size?

Hm. It would be nicer not to require contiguous sectors anywhere. How
about a variable- or fixed-height tree?

Post by Anthony Liguori
uint32_t first_cluster; /* in clusters */

First cluster of what?

Post by Anthony Liguori
uint64_t features; /* format feature bits */
uint64_t compat_features; /* compat feature bits */
uint64_t l1_table_offset; /* L1 table offset, in clusters */
uint64_t image_size; /* total image size, in clusters */

Logical, yes?

Is the physical image size always derived from the host file metadata?
Is this always safe?

Post by Anthony Liguori
/* if (features & QED_F_BACKING_FILE) */
uint32_t backing_file_offset; /* in bytes from start of header */
uint32_t backing_file_size; /* in bytes */

It's really the filename size, not the file size. Also, make a note
that it is not zero terminated.

Post by Anthony Liguori
/* if (compat_features & QED_CF_BACKING_FORMAT) */
uint32_t backing_fmt_offset; /* in bytes from start of header */
uint32_t backing_fmt_size; /* in bytes */

Why not make it mandatory?

Post by Anthony Liguori
}

Need a checksum for the header.

Post by Anthony Liguori
==Extent table==
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
Table {
uint64_t offsets[TABLE_NOFFSETS];
}

It's fashionable to put checksums here.

Do we want a real extent-based format like modern filesystems? So after
defragmentation a full image has O(1) metadata?

Post by Anthony Liguori
+----------+
| L1 table |
+----------+
,------' | '------.
+----------+ | +----------+
| L2 table | ... | L2 table |
+----------+ +----------+
,------' | '------.
+----------+ | +----------+
| Data | ... | Data |
+----------+ +----------+
The table_size field allows tables to be multiples of the cluster
size. For example, cluster_size=64 KB and table_size=4 results in 256
KB tables.
=Operations=
==Read==
# If L2 table is not present in L1, read from backing image.
# If data cluster is not present in L2, read from backing image.
# Otherwise read data from cluster.

If not in backing image, provide zeros

Post by Anthony Liguori
==Write==
# If L2 table is not present in L1, allocate new cluster and L2.
Perform L2 and L1 link after writing data.
# If data cluster is not present in L2, allocate new cluster. Perform
L1 link after writing data.
# Otherwise overwrite data cluster.

Detail copy-on-write from backing image.

On a partial write without a backing file, do we recommend zero-filling
the cluster (to avoid intra-cluster fragmentation)?

Post by Anthony Liguori
The L2 link '''should''' be made after the data is in place on
storage. However, when no ordering is enforced the worst case
scenario is an L2 link to an unwritten cluster.

Or it may cause corruption if the physical file size is not committed,
and L2 now points at a free cluster.

Post by Anthony Liguori
The L1 link '''must''' be made after the L2 cluster is in place on
storage. If the order is reversed then the L1 table may point to a
bogus L2 table. (Is this a problem since clusters are allocated at
the end of the file?)
==Grow==
# If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW.
The L1 table is not big enough.

With a variable-height tree, we allocate a new root, link its first
entry to the old root, and write the new header with updated root and
height.

Post by Anthony Liguori
# Write new image_size header field.
=Data integrity=
==Write==
Writes that complete before a flush must be stable when the flush
completes.
If storage is interrupted (e.g. power outage) then writes in progress
may be lost, stable, or partially completed. The storage must not be
otherwise corrupted or inaccessible after it is restarted.

We can remove this requirement by copying-on-write any metadata write,
and keeping two copies of the header (with version numbers and
checksums). Enterprise storage will not corrupt on writes, but
commodity storage may.

--
error compiling committee.c: too many arguments to function

Anthony Liguori

2010-09-07 15:40:46 UTC