[This reply is lengthy and constitutes the first revision of a file
system semantics for MTA administrators mini-HOWTO. Feel free to
comment.]
Post by DBSan ext3 filesystem loaded in full journal mode stores not only
metadata in the journal but also the actual file contents.
you won't need to use mount -o sync in this case and worry about
losing mail.
That's wrong and dangerous. The journal is filled asynchronously if you
go without -o sync or equivalents, even with data=journal. Clear text:
Even though you're using data=journal, the data written after fsync(),
namely an essential link(2), are only in RAM, i. e. can get lost in a
crash or power outage.
The short Linux recommendation is: go update util-linux, kernel and
e2fsprogs, place your queue on a ext2fs or ext3fs file system, shut down
your MTA (i. e. kill qmail-send, use svc -d), drop -o sync from fstab,
reboot, and use "chattr +D -R /var/qmail/queue" once. I'm not
recommending ReiserFS at this point, and I can't say if chattr +S or +D
or even -o sync will have effects. My personal preference is clearly
ext3fs with chattr +D at this time.
Here is a hopefully comprehensive write-up about these features.
I reserve the full copyright, verbatim distribution via the qmail list
and its archives is permitted though. Redistribution data obtained from
the archives is NOT permitted at this point. You can always write the
qmail-***@list.cr.yp.to command or archive URL will place this
document under a more liberal license after it's been reviewed and
updated. Caching proxies are also allowed to redistribute my data, ask
for details if necessary. Redistributing cached data other than regular
operation a default squid or apache install would do is NOT permitted.
There are two requirements that are confused here. I'll clarify these
and correct and elaborate my former "you need mount -o sync" claim,
mount -o sync is one, the most portable, way to achieve the semantics
that qmail relies on. It is expensive, there are cheaper ways. Read on,
this will not be a qmail-is-dumb flame feast that I'm accused of so
often.
1. the on-disk consistency of the file system
2. the ordering of ACTUAL PHYSICAL write operation and REPORTED
COMPLETION. As Michael has pointed out, backing out the transaction
means losing the mail.
Some explanations of how things work.
Contents:
0. PREFACE
1. CONSISTENCY. What journalling or softupdates do.
2. PERSISTENCE, SYNCHRONIZATION AND ORDERING
3. IMPAIRMENT OF ORDERING
4. TUNING
5. OTHER SYSTEMS AND FUTURE RESEARCH
=== PREFACE ===
In this document, except the TUNING section, -o sync means "any
synchronous mechanism", it might mean chattr +S, -o dirsync, chattr +D
(careful, read the TUNING section below).
=== CONSISTENCY ===
The ext3fs journal records transactions to meta data (and possibly data
in data=journal mode, at the expense of write speed). ext3fs
data=ordered mode makes sure that ALL data modifications be written
before the meta data are updated. data=writeback merely journals the
meta data changes, but makes no guarantees about the order of when data
or meta data are written.
data=ordered and data=journal make sure that if a NEW file is written,
it's integer. With data=writeback, the file may be on disk, but the
contents may have been lost in a crash. (This is the same as for
ext2fs.) This can frequently be observed on ReiserFS or ext2 systems
when the computer crashes under heavy asynchronous write load.
The ReiserFS journal is metadata-only and corresponds to data=writeback
unless you use Chris Mason's patches and force data=journal or
data=ordered. (I'm unsure if a vanilla reiserfs accepts data=writeback.)
The BSD softupdates code effectively makes a file system "async", but
makes sure that no unordered writes corrupt the file system structure.
So, either of these mechanisms, logging/journaling or softupdates, make
sure the file system comes up quick and clean after an unclean shutdown.
=== PERSISTENCE, SYNCHRONIZATION AND ORDERING ===
The other issue is -o sync. I'll elaborate on the tuning later, mount -o
is the big cannon that shoots at the bird and may be unacceptably
expensive.
When an application (say, qmail) tells the kernel "write me that data to
disk", then the change may either be a change to file data, file meta
data or to directory data. Let's subsume file data under file meta data
for now. File data are the actual file contents. File Meta data are file
size, creation date, and so on. Directory data are -- simplified -- the
file names. If an application creates a new file, writes to it, and closes
it, then there are file data and directory data.
"synchronous" write means: if an application uses a kernel function,
this function will only return to the application after the data has
been written to physical media to the best of the kernel's knowledge.
"asynchronous" write means: the kernel function may return to the
application before the data has been written in physical media, for
example if the data may be at a cache.
Asynchronous data is written back later, usually, these data are sorted
by disk block or something to increase the efficiency, and collected
into larger write commands, again, to increase the efficiency. IIRC,
BSD softupdates claims 90 s, Linux trickles dirty writes every 30 s.
Don't quote me on these two figures though.
Qmail REQUIRES that -- among others -- the link(2) kernel function is
synchronous, i. e. the link(2) function MUST NOT return before the data
are physically on the disk. Reason: right after the completion, qmail
tells the SMTP client "250 Ok"; and the client will delete its queue
file immediately. qmail has taken over the responsibility for the mail.
Link(2) is a "directory write". Linux has always written directory data
asynchronously, unless -o sync was in place. BSD with async or
softupdates also writes directory data asynchronously. Consequence:
qmail takes over responsibility BEFORE the data are on disk physically.
This can cause mail loss, if the computer crashes or the power fails
before the data have been written from the RAM cache to physical media.
So the ordering requirement is clear: link(2) must first write the data
to physical media before returning control to the application (qmail).
mount -o sync and equivalent mechanisms force directory updates to be
synchronous, so that qmail can be reliable at all. Read the next section
why this is not always sufficient.
=== IMPAIRMENT OF ORDERING ===
Hard disk drives use caches to improve the write speed, and these may
reorder the blocks that are written to disk. Hard disk drives do NOT
guarantee that cached data will survive a power outage.
Hard disk drives (except for some broken models, reported on the
Linux-Kernel mailing list, some 2.5" drives IIRC) allow to turn off the
cache, to make sure the writes are ordered.
SCSI drives have also offered the "tagged command queueing" features for
many years, which includes a "ordered tag" facility that makes sure that
all writes complete before the write with the ordered tag, and that all
writes after the ordered tag are not started earlier than the write that
was associated with the ordered tag.
The recent ATA standard revisions also support this "dma queued"
feature, but it's not as widely deployed, and Linux does not support it
currently. Later versions may, there are some developer patches. FreeBSD
supports it on IBM DPTA, DTLA and IC35* drives. The only other ATA
drives known to me that offer queueing are the IBM DTTA (currently
unsupported by FreeBSD, would require workaround) and the IBM DJNA (as
per Søren Schmidt, their tagged implementation is so flawed that it's
unusable).
(IBM DTLA and IC35L...AVER drives are claimed to be unreliable. I've had
four out of eight DTLA drives, bought in Early 2001, and from 3
different vendors fail on me within 18 months after purchase. Other
people reported AVER dying far too soon as well, go search Usenet).
However, to make drives look good in benchmarks, most drives ship with
the write cache enabled, and guess what? This defeats the ordering
mentioned in the previous section. The link(2) may have made it to the
drive's cache, but not be on disk. If the power fails before the drive
had a chance to flush the cache, the mail is again lost.
Guess even more: Linux does not by default use ordered tags properly.
I'm not aware of the current status of the "write barrier" patches; last
time, I looked, they were available for ATA and only for specific SCSI
systems, and not for all file systems, and were scheduled for Linux 2.6.
Chris Mason should know more on this topic.
So, to be really safe, you must for now switch the write cache off on
Linux.
I'm not sure how good other operating systems, including FreeBSD, are.
If in doubt, going with the write cache turned off is the safe way.
Tagged queueing compensates for some of the speed loss because it
overcomes the drive gets rid of the lock-step approach (accept block of
data, wait for disk to rotate, write, acknowledge write, reiterate) that
is inferred without write cache.
=== TUNING ===
There's not much about the drive's write cache unless the file system
you are using knows how to make use of ordered tags and the drive
supports these.
There is something about the -o sync though. On ext2fs or ext3fs, it is
possible to use chattr -R +S /var/qmail/queue and mount WITHOUT -o sync,
that way, other /var directories remain asynchronous (for example,
/var/lib/dhcp, /var/log and /var/spool/news).
With recent linux kernel, util-linux and e2fsprogs versions (I checked
e2fsck 1.28, util-linux 2.11u and Linux 2.4.19, as shipped e. g. on SuSE
Linux 8.1), there is an additional option: -o dirsync, and chattr -R +D
/var/qmail/queue.
The original patches that Andrew Morton had were against Linux
2.4.18-pre9, e2fsprogs 1.26 and util-linux 2.11n, so versions AFTER but
not including these are candidates. Use "strings /bin/mount | grep
dirsync" to find out if your util-linux is current enough. Just update
e2fsprogs to get the latest e2fsck bug fixes and chattr/lsattr support.
This -o dirsync (or chattr +D) makes only directory writes such as
link(2) synchronous, while leaving file writes asynchronous. BEWARE: on
very old systems, +D used to have a different meaning that was never in
use, this old meaning has been renamed to +Z.
So, instead of mount -o sync, you can use chattr -R +S on ext2fs or ext3fs
on any system. You can also go for mount -o dirsync or chattr -R +D
/var/qmail/queue on the state-of-the-art system.
Linux' -o dirsync on ext2fs and ext3fs corresponds to -o noasync on BSD
ffs without softupdates.
The chattr limits the impact to the directory it's applied to, while
mount -o [dir]sync applies to the whole partition; you'll have to decide
what you find appropriate.
chattr -R +D is not slower than chattr -R +S, but will usually be faster.
mount -o dirsync is not slower than mount -o sync, but will usually be faster.
I've made some benchmarks (only one run, so only look at the rough
relations) and found very strange results as to the ReiserFS behaviour
that I'll have to ask the ReiserFS team about, because -o sync does not
slow ReiserFS down considerably, and this is very suspicious.
Felix von Leitner uses ext3fs and will probably discourage from using
reiserfs when you ask him.
I'm not sure if -o sync has any effect of BSD softupdates. If it does,
softupdates + -o sync will be safe, if not, use a file system with the
classical ffs, without softupdates.
Here are the results of a bonnie benchmark, conducted with Linux 2.4.19
on a Maxtor 4W060H4 60 GB 5400/min ATA drive with write cache switched
off, attached to a VIA VT52C686 IDE adapter. ext2a means default (async)
mount. ext2d means -o dirsync. ext2s means -o sync. The machine had 140
MB free RAM and had its swap turned off for the test. The software is
Russell Coker's bonnie++-1.02c.
Looking at these figures, I wonder if the "mount -o sync has always just
cwmadeupdated the directory data synchronous" claim made by some Linux
Kernel folks still holds. If it did, sequential output would have had to
be much faster.
Remember, it's been a single run on a loaded workstation, so these
figures are not too accurate, but should give an idea of what's
happening. I'll offer ext3 figures later when the whole set of 9
benchmarks has completed (that is, combine each of (writeback, ordered,
journal) with each of (defaults, dirsync, sync)).
Version 1.02c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ext2a 360M 5483 6 3187 4 28930 23 112.5 1
ext2d 360M 4974 4 3310 4 23403 17 97.2 1
ext2s 360M 191 0 637 1 30690 21 113.5 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
ext2a 10:10000:0/23 12141 97 +++++ +++ 1266 4 12307 97 +++++ +++ 1214 7
ext2d 10:10000:0/23 2073 15 +++++ +++ 1923 7 2185 17 19844 93 1605 8
ext2s 10:10000:0/23 17 0 4577 28 21 0 17 0 351 2 19 0
=== OTHER SYSTEMS AND FUTURE RESEARCH ===
5.1 OTHER SYSTEMS
5.2 FUTURE RESEARCH
--- OTHER SYSTEMS ---
I have been asked why Postfix's queue can go without -o sync on Linux.
The answer is simple: because it does chattr -R +S /var/spool/postfix on
start-up itself, have a look at /etc/postfix/postfix-script.
For the mailboxes, it will require dirsync (or sync) semantics just like
qmail.
However, Postfix's queue can go without even the +S or +D on ext3fs and
on reiserfs as of Linux 2.4 (not Linux 2.2), and it can go on
softupdates file systems. The reason is that Postfix does not distribute
its queue status across three files, but keeps a single file with an
internal structure that comprises an end marker record -- if this is
missing, the mail is not delivered. Postfix' queue process ends with a
fsync(), not with a link(). fsync() has the feature of flushing all
pending transactions with ext3fs (not ext2fs) and reiserfs as of Linux
2.4 (won't work with 2.2), so all pending directory updates (such as
open, which is prior to fsync()) will be on permanent media once the
fsync() call has returned.
--- FUTURE RESEARCH ---
* figure what file systems and kernel versions know how to use ordered
tags properly
* compile a list of all drives that support tagged queueing
* figure if reiserfs -o sync or -o dirsync are implemented and/or
working properly, and figure the chattr status
* figure if -o sync makes softupdates safe.
--
Matthias Andree