Discussion:
Trying to understand CCKD integrity
Tony Harminc
2008-08-26 02:09:03 UTC
Permalink
I'm trying to understand how a host power failure can affect CCKD
write processing. I have read the CCKD doc, and some of the code, but
I'm still not clear how the FSYNC, FREEPEND, and GCINT options
interact. In particular, if FSYNC is not specified, what is the limit
(if any) to how long a write from the guest system can remain
unwritten to the real disk, and thus subject to complete loss? With
default GCINT and FREEPEND values of 5 and -1 (= 2 with no FSYNC, if
I understand correctly), given a large enough host OS cache, could
this data just sit in Herc and/or host cache for weeks waiting for the
power to go off?

In the current case, a system was up for a couple of weeks, with
various write activity. The weekend saw an apparent power failure, and
after running a cckdcdsk -4, the guest OS claims that one of the
volumes is not labelled! I am completely sure that the volume label
was not written by the OS during the IPL, and further, has not changed
since the base file was created. However, looking at the shadow file
with cckddiag, Cyl 0 Track 0 is present but all zeros. :-( There's a
good looking one in the base file, but I don't know how to copy just
that one track from base to shadow.

Have I done this to myself by using -4 on cckdcdsk, in other words
does it try to produce a null track of some sort even if there's
nothing to base it on? Or perhaps it stumbled across some zeros that
looked like a plausible track 0 to recover...? Well, I'm speculating,
and of course I have backups from before I started fiddling with
cckdcdsk, but -3 doesn't seem to fix my original problem.

Thanks,

Tony H.
Tony Harminc
2008-09-02 23:34:40 UTC
Permalink
Replying to myself here...

I found out why cckdcdsk -4 is not recovering all (or sometimes most)
of the available tracks. For many track images, the uncompress()
routine returns a -5 (Z_BUF_ERROR), which is a complaint that the
supplied output buffer is not big enough to contain all the
uncompressed data. But cckdcdsk supplies a 65531 byte buffer, which is
enough to hold any possible track even on a 3390. Since cckdcdsk is
using uncompress() only as a track image validator, it takes this
return code to mean that the input buffer was not a valid track image,
and moves on to the next possibility.

This appears to be a 64-bit problem, I believe a mismatch between the
ZLIB library routine uncompress() and its caller. I am running 3.05 on
Fedora on I-64, and the problem can be resolved by changing the
declaration of bufl in routine cdsk_valid_trk() from int to size_t.
I'm not sure this is quite right - it provokes signed vs unsigned
warnings in the BZ2 code further down - but it does get my data back.
I think there is no compiler warning for the original mismatch because
the pointer to int is being cast to a void *.

May I say in passing that I found this CCKD code to be much more
subtle and well thought out than I realized at first.

Tony H.
Post by Tony Harminc
I'm trying to understand how a host power failure can affect CCKD
write processing. I have read the CCKD doc, and some of the code, but
I'm still not clear how the FSYNC, FREEPEND, and GCINT options
interact. In particular, if FSYNC is not specified, what is the limit
(if any) to how long a write from the guest system can remain
unwritten to the real disk, and thus subject to complete loss? With
default GCINT and FREEPEND values of 5 and -1 (= 2 with no FSYNC, if
I understand correctly), given a large enough host OS cache, could
this data just sit in Herc and/or host cache for weeks waiting for the
power to go off?
In the current case, a system was up for a couple of weeks, with
various write activity. The weekend saw an apparent power failure, and
after running a cckdcdsk -4, the guest OS claims that one of the
volumes is not labelled! I am completely sure that the volume label
was not written by the OS during the IPL, and further, has not changed
since the base file was created. However, looking at the shadow file
with cckddiag, Cyl 0 Track 0 is present but all zeros. :-( There's a
good looking one in the base file, but I don't know how to copy just
that one track from base to shadow.
Have I done this to myself by using -4 on cckdcdsk, in other words
does it try to produce a null track of some sort even if there's
nothing to base it on? Or perhaps it stumbled across some zeros that
looked like a plausible track 0 to recover...? Well, I'm speculating,
and of course I have backups from before I started fiddling with
cckdcdsk, but -3 doesn't seem to fix my original problem.
Thanks,
Tony H.
Greg Smith
2008-09-03 01:43:57 UTC
Permalink
Post by Tony Harminc
Replying to myself here...
I found out why cckdcdsk -4 is not recovering all (or sometimes most)
of the available tracks. For many track images, the uncompress()
routine returns a -5 (Z_BUF_ERROR), which is a complaint that the
supplied output buffer is not big enough to contain all the
uncompressed data. But cckdcdsk supplies a 65531 byte buffer, which is
enough to hold any possible track even on a 3390. Since cckdcdsk is
using uncompress() only as a track image validator, it takes this
return code to mean that the input buffer was not a valid track image,
and moves on to the next possibility.
This appears to be a 64-bit problem, I believe a mismatch between the
ZLIB library routine uncompress() and its caller. I am running 3.05 on
Fedora on I-64, and the problem can be resolved by changing the
declaration of bufl in routine cdsk_valid_trk() from int to size_t.
I'm not sure this is quite right - it provokes signed vs unsigned
warnings in the BZ2 code further down - but it does get my data back.
I think there is no compiler warning for the original mismatch because
the pointer to int is being cast to a void *.
May I say in passing that I found this CCKD code to be much more
subtle and well thought out than I realized at first.
Tony H.
Post by Tony Harminc
I'm trying to understand how a host power failure can affect CCKD
write processing. I have read the CCKD doc, and some of the code, but
I'm still not clear how the FSYNC, FREEPEND, and GCINT options
interact. In particular, if FSYNC is not specified, what is the limit
(if any) to how long a write from the guest system can remain
unwritten to the real disk, and thus subject to complete loss? With
default GCINT and FREEPEND values of 5 and -1 (= 2 with no FSYNC, if
I understand correctly), given a large enough host OS cache, could
this data just sit in Herc and/or host cache for weeks waiting for the
power to go off?
In the current case, a system was up for a couple of weeks, with
various write activity. The weekend saw an apparent power failure, and
after running a cckdcdsk -4, the guest OS claims that one of the
volumes is not labelled! I am completely sure that the volume label
was not written by the OS during the IPL, and further, has not changed
since the base file was created. However, looking at the shadow file
with cckddiag, Cyl 0 Track 0 is present but all zeros. :-( There's a
good looking one in the base file, but I don't know how to copy just
that one track from base to shadow.
Have I done this to myself by using -4 on cckdcdsk, in other words
does it try to produce a null track of some sort even if there's
nothing to base it on? Or perhaps it stumbled across some zeros that
looked like a plausible track 0 to recover...? Well, I'm speculating,
and of course I have backups from before I started fiddling with
cckdcdsk, but -3 doesn't seem to fix my original problem.
Tony,

I apologize for not responding sooner. I saw your original message
after I returned from a vacation and was thinking I replied to it but
guess I just thought about it and decided that too much time had passed
or something. Lets call it summertime fever :)

An unexpected power-off can be devastating to a cckd file. I'm not a
hardware expert but I understand that during power-off while a write is
in progress to a host disk that garbage can be written to the disk
because the disk could be reading bits/bytes from the bus but the CPU is
down.

I've implemented a number of solutions to try to mitigate the problem.

One is lazy-write. All outstanding writes are `flushed' at each
`garbage collection' interval which, by default, is every 10 seconds.
So if you are running a disk monitor while running hercules you should
see activity spikes every ten seconds. However, if the cache fills up
before then, then the flush will occur at that time (this should be an
unusual event). The idea is that cckd stages all its writes at one time
and if a power-off event occurs, it won't `probably' be at that time.

Two is that all writes for a track image occur to a free space. If a
power-off event occurs during the write for the new track image then
theoretically the old track image is still available to be recovered.

Three is pending free space. Once a new track image has been written
the space occupied by the previous track image is not immediately
released to the free space pool. Rather, it has to `age' over so many
garbage collection cycles (intervals of 10 seconds). This tries to
mitigate the effect of the host operating system caching (delaying)
updates to the physical disk.

Fourth is attempting to move all meta-data to the beginning of the file.
There are four meta-data structures: the ckd header, the cckd header,
the level 1 table (l1tab) and the level 2 tables (l2tabs). All but the
l2tabs must be at fixed points in the beginning of the file, but now the
code tries to move all l2tabs to the beginning of the file.

Fifth, there is a fifth meta-data structure and that is free space. Now
free space is read when the file is opened and written when the file is
closed. That means that if the free space was an older track image its
first 8 bytes are not overwritten with the free space chain data. Free
space is easy to recover if the file is not closed properly but the rest
of the file is intact: it's the space in the file not occupied by any
other spaces.

When a new track image is written it is going to be written into a free
space or at the end of the file. The new track image's offset into the
file may be before or after the old track image's offset.

The space occupied by the old track image will eventually added to the
free space. The `garbage collector', which runs every 10 seconds, moves
some number of track images to free spaces that have a lower offset than
the track image. The space previously occupied by the track image is
considered pending free space exactly as if the track image had been
flushed per above. That is, the garbage collector moves active track
images towards the beginning of the file and (non-pending) free spaces
towards the end of the file. When the free space reaches the end of the
file, the file is truncated (ie reduced in size).

What I'm leading to (I hope) is that when file recovery takes place,
there may be multiple valid track images for a given track. Somewhat
arbitrarily, the recovery routine (cckd_chkdsk) chooses the one closest
to the beginning of the file.

So much for the basics.

cckdcdsk -4 is interesting because it tries to recover track images
without any help from meta-data at all.

But you know all this already, and I am super impressed ;-) I am also
impressed by you going the extra mile.

If you can give me a `diff -u cckdutil.c cckdutil.c.fixed' I will try to
make the necessary code changes to properly get rid of all the warnings
etc and will give you full credit for the fix if you wish.

The `subtle' comment above bothers me a bit. If you can critique the
code a bit to make it less subtle, it would be appreciated. Don't worry
about hurting my feelings, I write so much code that code I wrote a year
or two ago is as foreign to me as code written by someone else.

Greg
somitcw
2008-09-03 02:14:09 UTC
Permalink
--- In hercules-390-***@public.gmane.org,
Greg Smith <***@...> wrote:
- - - snipped - - -
Post by Greg Smith
cckdcdsk -4 is interesting because it tries
to recover track images without any help from
meta-data at all.
- - - snipped - - -

Is switch -4 a secret?

cckdcdsk [-v] [-f] [-level] [-ro] file1 [file2 ...]

-v display version and exit

-f force check even if OPENED bit is on

level is a digit 0 - 3:
-0 -- minimal checking
-1 -- normal checking
-3 -- maximal checking

-ro open file readonly, no repairs
Greg Smith
2008-09-03 02:28:35 UTC
Permalink
Post by somitcw
- - - snipped - - -
Post by Greg Smith
cckdcdsk -4 is interesting because it tries
to recover track images without any help from
meta-data at all.
- - - snipped - - -
Is switch -4 a secret?
cckdcdsk [-v] [-f] [-level] [-ro] file1 [file2 ...]
-v display version and exit
-f force check even if OPENED bit is on
-0 -- minimal checking
-1 -- normal checking
-3 -- maximal checking
-ro open file readonly, no repairs
Yeah, it is ;-) Maybe because I couldn't always get it to work right...
maybe because of the bug that Tony discovered. So may be it won't be
any more.

Greg
Greg Smith
2008-09-03 02:54:52 UTC
Permalink
Post by somitcw
Is switch -4 a secret?
Yeah, it is ;-) Maybe because ...
Hmmphhh. Upon reflection, it may be because I'm a lazy programmer. I
write a cool piece of code but can't be bothered to doc it... Wait a
minute, looks like I did update the html file but not the usage code.

SIGH

OK, thanks for that!

Greg
Tony Harminc
2008-09-03 16:29:53 UTC
Permalink
Post by Greg Smith
I apologize for not responding sooner. I saw your original message
after I returned from a vacation and was thinking I replied to it but
guess I just thought about it and decided that too much time had passed
or something. Lets call it summertime fever :)
The amount I pay you guys for support, and this is what I get... :-)
Post by Greg Smith
An unexpected power-off can be devastating to a cckd file. I'm not a
hardware expert but I understand that during power-off while a write is
in progress to a host disk that garbage can be written to the disk
because the disk could be reading bits/bytes from the bus but the CPU is
down.
That is/was actually architected that way on S/360 and 370, and I'm
sure there's room for similar trouble on any kind of drive, but I
would guess that pretty much any power supply will keep drive and CPU
going for a few milliseconds. But certainly it's not a highly
predictable state of affairs.
Post by Greg Smith
I've implemented a number of solutions to try to mitigate the problem.
When I first started debugging this problem I kept thinking "this CCKD
and shadow file stuff is really neat, but it obviously wasn't designed
by someone with a background in reliable computing". Then as I
discovered each of the mitigating schemes, I got more and more
impressed, and eventually I completely took back what I had thought at
the beginning.
Post by Greg Smith
One is lazy-write. All outstanding writes are `flushed' at each
`garbage collection' interval which, by default, is every 10 seconds.
I'm still not sure if FSYNC must be specified to make this happen.
Post by Greg Smith
Fifth, there is a fifth meta-data structure and that is free space.
This, I think, didn't make it into the html doc. The code talks about
"new format" free space.
Post by Greg Smith
cckdcdsk -4 is interesting because it tries to recover track images
without any help from meta-data at all.
I started to look hard at -4 because it became clear that my metadata
was unreliable. In particular, several pointers were to valid looking
track data, but for the wrong track. Then I realized that even -4
wasn't recovering those valid-looking track images, and I assumed they
were corrupt as a result of the power failure (as you suggested
above). So I modified the standalone cckddiag program to allow me to
specify an offset and length in the file to try to decompress and
display. (Actually this was trivial, because there is already an
offset/length option which will display the raw data, so it was just a
matter of adding an option to try to uncompress, and format the
results.) In every case I looked at, cckddiag uncompressed and
displayed good looking data. As you know, cckddiag is modeled on
cckdcdsk, so I had to try to figure out why one worked and the other
failed on the same data, and nearly identical calls. A mess of
debugging statements later, it was clear that the compressed data was
identical, and the lengths being passed to uncompress() were
identical, so how could there possible be any difference? (Well,
actually, as a colleague pointed out, the length in cckddiag is
probably too big by CKDDASD_TRKHDR_SIZE, but cckddiag is the one that
was working!) It also turns out that ZLIB's uncompress() is pretty
forgiving about its input length; it generally uncompresses as best it
can, and doesn't complain if the input length is too high, and even if
it's too low, it still returns what it can, and complains with a
Z_DATA_ERROR rather than the Z_BUF_ERROR I was getting. I read some of
the ZLIB code, and there are some more subtle reasons for a
Z_BUF_ERROR that are not as simple as "your output buffer is too
small", but since the input data was identical, I couldn't see any of
those being the problem. So really cckddiag was the clue; it uses a
size_t for its output buffer length, vs the int that cckdcdsk uses.
That resolved the problem, and only later I realized that cckddasd.c
uses an unsigned long, which is probably correct.

I don't know anything about how library parameters and arguments are
resolved in a mixed mode environment like Linux on I-64. I suppose the
call by value ones pretty much look after themselves, and probably it
was just that (void *) that allowed this one to sneak by. The
uncompress() routine in ZLIB itself declares the parameter as "uLong",
which presumably is the same as "unsigned long".

Sigh - my c programming days are mostly far behind me...
Post by Greg Smith
But you know all this already, and I am super impressed ;-)
I just read the code; you designed and wrote it.
Post by Greg Smith
I am also impressed by you going the extra mile.
I really wanted that data back, and more important, to know *why* I
couldn't have it.
Post by Greg Smith
If you can give me a `diff -u cckdutil.c cckdutil.c.fixed' I will try to
make the necessary code changes to properly get rid of all the warnings
etc and will give you full credit for the fix if you wish.
It's a one-word change from "int" to, I guess, "unsigned long".
Probably the cast to void * should be removed to let the compiler warn
about similar problems in future. All the debugging prints I added are
just clutter, but I think it would be very nice to have a summary of
what was recovered and how, and what was tried and failed. Controlled
by yet another option, of course. :-) Maybe I will try to add that.

I will clean up and send you the changes to cckddiag; I think those
can be quite handy when working at this level of diagnosis. That
module is flagged "James M. Morrison 2003", which is a name I
remember, but can't quite place.
Post by Greg Smith
The `subtle' comment above bothers me a bit. If you can critique the
code a bit to make it less subtle, it would be appreciated. Don't worry
about hurting my feelings, I write so much code that code I wrote a year
or two ago is as foreign to me as code written by someone else.
I meant subtle in the best possible way! The thing that stuck me the
longest was where it scans forward for the *next* potential track
header, and then uses the difference between the current one and that
to try to decompress. Only after trying each possible difference does
it go back for one last-ditch try with every length starting at 13 and
working up to the track size. Since the uncompress() was failing, it
always fell through into trying all possible lengths, and that threw
me. I did add a few block comments at the front of this area as I
worked through it.

I believe this is a 64-bit problem only, and even that may depend on
the architecture and OS/library. But I'm also not sure that it won't
occur with any level of recovery, in other words it is not necessary
to specify -4 to cause failure to recover data that is there. Why it
ever works in this environment is a bit of a mystery. I think it
depends on the content of the word following the int on the stack, and
I don't want to bother thinking through all the little-endian evilness
that might allow that to sometimes work.

Tony H.
Tony Harminc
2008-09-03 17:46:31 UTC
Permalink
Oh yeah - one other problem I encountered...

In some cases cckdcdsk "over-recovers", that is, it finds track images
that aren't really there. This is because it is willing to accept
images compressed with any supported scheme, and this includes none.
Finding a bogus image somewhere in the free space that translates into
a null track 0 is not too hard, and causes no end of problems.

The quick fix is to make sure that only track images with a
compression type that matches that in the cckd header are used:

Is there any legitimate case where there are images with mixed
compression types in a file? Stress writes just change the compression
parameters, don't they? Or can they change to no compression?

Tony H.
Tony Harminc
2008-09-03 23:42:30 UTC
Permalink
2008/9/3 Tony Harminc <tharminc-***@public.gmane.org>:

Talking to myself again...
Post by Tony Harminc
Oh yeah - one other problem I encountered...
In some cases cckdcdsk "over-recovers", that is, it finds track images
that aren't really there. This is because it is willing to accept
images compressed with any supported scheme, and this includes none.
Finding a bogus image somewhere in the free space that translates into
a null track 0 is not too hard, and causes no end of problems.
The quick fix is to make sure that only track images with a
Is there any legitimate case where there are images with mixed
compression types in a file? Stress writes just change the compression
parameters, don't they? Or can they change to no compression?
Well... one more iteration. Yes, it seems that all three compression
schemes can coexist in a single shadow file, so the above fix lost
more data than it corrected. I was surprised to find ZLIB and BZ2 in
the same file, but I think this comes about if the base and shadow
files have differing compression.

Anyway - the specific problem here is that recovery was finding a
bogus track 0 that was actually part of an L2 table (the
FFFFFFFFFFFFFFFF is both a null L2 entry and a plausible EOF record in
an uncompressed track 0.) A little Turing test there; *I* could see at
a glance that it wasn't a track image, because it was in the middle of
a table, but it's hard to specify a clear rule for this. So for now I
made an arbitrary change to ignore a track image during recovery if it
is uncompressed, and track 0, and has no user (i.e. > R0) records.

Tony H.
Greg Smith
2008-09-04 02:29:24 UTC
Permalink
Post by Tony Harminc
Talking to myself again...
Post by Tony Harminc
Oh yeah - one other problem I encountered...
In some cases cckdcdsk "over-recovers", that is, it finds track images
that aren't really there. This is because it is willing to accept
images compressed with any supported scheme, and this includes none.
Finding a bogus image somewhere in the free space that translates into
a null track 0 is not too hard, and causes no end of problems.
Yeah, that's what I thought the original problem was about. I think I
need to add a `confidence level' for recovered tracks. For example, a
possible track 0 with 5 bytes of zeroes for the HA and 8 bytes of zero
for r0 and 8 bytes of 0xff's should have a lower confidence level than a
possible track 0 with VOL1 in r1.
Post by Tony Harminc
Post by Tony Harminc
The quick fix is to make sure that only track images with a
Is there any legitimate case where there are images with mixed
compression types in a file? Stress writes just change the compression
parameters, don't they? Or can they change to no compression?
Well... one more iteration. Yes, it seems that all three compression
schemes can coexist in a single shadow file, so the above fix lost
more data than it corrected. I was surprised to find ZLIB and BZ2 in
the same file, but I think this comes about if the base and shadow
files have differing compression.
Let's assume all three compression levels (none, zlib, bz2) are
supported ... let's face it, today they are. If your dasd file wasn't
init'ed with -bz2 then it won't have bz2 compressed tracks in it. Let's
suppose it was.

In this case, you can have all three types of compression in the file,
depending on the `stress'. I forget exactly what stress is but guess it
has something to do with the usage of the cache etc. Then tracks could
be compressed using zlib instead of bz2 (because in my subjective
opinion zlib compresses faster but not as well as bz2) or not compressed
at all depending on the size of the track to be written.

So, yeah, it's not unusual to have a cckd file with all three levels of
compression.
Post by Tony Harminc
Anyway - the specific problem here is that recovery was finding a
bogus track 0 that was actually part of an L2 table (the
FFFFFFFFFFFFFFFF is both a null L2 entry and a plausible EOF record in
an uncompressed track 0.) A little Turing test there; *I* could see at
a glance that it wasn't a track image, because it was in the middle of
a table, but it's hard to specify a clear rule for this. So for now I
made an arbitrary change to ignore a track image during recovery if it
is uncompressed, and track 0, and has no user (i.e. > R0) records.
I understand. I'll think about the confidence level thing. My thinking
is, the more complicated a possible track 0 is, the more probable it is
*the* track 0.

Greg
Tony Harminc
2008-09-04 16:55:52 UTC
Permalink
This post might be inappropriate. Click to display it.
Greg Smith
2008-09-04 22:54:00 UTC
Permalink
Post by Tony Harminc
Post by Greg Smith
I understand. I'll think about the confidence level thing. My thinking
is, the more complicated a possible track 0 is, the more probable it is
*the* track 0.
If there *is* a track 0. And all this applies to other tracks too,
though the probability of finding a bogus one decreases as the CC and
HH go up.
Had to think on this one for a while (uh-oh). There's a solution to the
problem but a further problem is exposed for cckdcdsk -4.

cckd has this concept of `null tracks'. There are three flavors of null
tracks, type 0, 1 and 2.

Type 0 (the original null track) is
HA 5 bytes
std R0 16 bytes
null R1 8 bytes (count with 0 kl and 0 dl)
EOT 8 bytes (0xffffffffffffffff)
for 37 bytes. This is an empty track with an MVS style EOF
(end-of-file) marker.

Turns out that the dasd utility, dasdinit, which predates cckd, writes
null tracks (type 1) in the form
HA 5 bytes
std R0 16 bytes
EOT 8 bytes (0xffffffffffffffff)
for 29 bytes. When it was pointed out that dasdinit writes one type of
null track for ckd and another type for cckd, support was added for both
types of null tracks in cckd. The level 2 table len field identifies
the type of null track (if the len field is too small to be a track
image then it indicates the type of null track it is).

Later a type 2 null track was added for linux support. If a dasdinit is
run with the -linux option then that dasd volume does not have to be
formatted for linux.

What this means is that cckd_write_trkimg does not write type 0 or type
1 track images. Rather, that is indicated in the meta-data (l2 len
field). In turn, any uncompressed track image must be greater than 37
bytes.

The good news is that this fixes your bogus track 0 problem.

The bad news is that there is no way to recover an MVS EOF if the EOF
record is the r1 record on the track ... that's indicated by meta-data
alone and cckdcdsk -4 ignores the meta-data (or tries to).

For a base file where type 0 is the default null trk type this isn't a
problem. For a shadow file, this will percolate up further and you may
have unwanted data at the end of your file.

Anyway, I have updated cckdutil.c with this change, and also with, I
hope, the length issues for 64bit. Should be available in tonight's
snapshot or currently available in cvs.

Greg
Tony Harminc
2008-09-08 15:34:07 UTC
Permalink
2008/9/4 Greg Smith <gsmith-***@public.gmane.org>:

[...]
There are three flavors of null tracks, type 0, 1 and 2.
[...]
Later a type 2 null track was added for linux support. If a dasdinit is
run with the -linux option then that dasd volume does not have to be
formatted for linux.
What does type 2 look like?
The bad news is that there is no way to recover an MVS EOF if the EOF
record is the r1 record on the track ... that's indicated by meta-data
alone and cckdcdsk -4 ignores the meta-data (or tries to).
Aha! I believe I have seen this in a small number of cases.
For a base file where type 0 is the default null trk type this isn't a
problem. For a shadow file, this will percolate up further and you may
have unwanted data at the end of your file.
So I suppose using cckdcdsk -4 really is a last resort. But in my case
the metadata seemed to be older than, or at least inconsistent with,
the track data.
Anyway, I have updated cckdutil.c with this change, and also with, I
hope, the length issues for 64bit. Should be available in tonight's
snapshot or currently available in cvs.
Thank you! I will give it a go.

Tony H.
Greg Smith
2008-09-08 21:32:20 UTC
Permalink
Post by Tony Harminc
[...]
There are three flavors of null tracks, type 0, 1 and 2.
[...]
Later a type 2 null track was added for linux support. If a dasdinit is
run with the -linux option then that dasd volume does not have to be
formatted for linux.
What does type 2 look like?
Type 2 is for 3390 only. It's 12 4K user records of zeroes and the CCHH
of the count fields matches the CCHH of r0 (meaning, I think, it doesn't
work for vm mini-disks).
Post by Tony Harminc
The bad news is that there is no way to recover an MVS EOF if the EOF
record is the r1 record on the track ... that's indicated by meta-data
alone and cckdcdsk -4 ignores the meta-data (or tries to).
Aha! I believe I have seen this in a small number of cases.
For a base file where type 0 is the default null trk type this isn't a
problem. For a shadow file, this will percolate up further and you may
have unwanted data at the end of your file.
So I suppose using cckdcdsk -4 really is a last resort. But in my case
the metadata seemed to be older than, or at least inconsistent with,
the track data.
Yes, it should be used as a last resort. I realized while recoding the
chkdsk utility that track images could be recovered in this way but, I
confess, I didn't think about the null track issue.

One possible explanation for the inconsistencies is that cckdcdsk -4
arbitrarily uses the first valid track image it finds for a track, that
is, the track image closest to the beginning of the file. This is
probably not correct for all tracks.
Post by Tony Harminc
Anyway, I have updated cckdutil.c with this change, and also with, I
hope, the length issues for 64bit. Should be available in tonight's
snapshot or currently available in cvs.
Thank you! I will give it a go.
Greg

BruceTSmith
2008-09-04 04:34:26 UTC
Permalink
Let's look at what happens on real iron when you have a power failure...

Modern tech, like the SHARK disk array (ESS), does have a battery
backup. It also has a huge cache. Everything to/from the channels goes
to cache first. When the hardware "has time" it does the actual read /
write to the drives. A side note relative to this thread, the SHARK
does do data compression. You always see uncompressed data on the
channels, but it is compressed on the drives. This is part of the
reason for such a big cache, the hardware needs time to do the
compression AND the physical IO...

When you have a power failure, the SHARK keeps the drives and cache
powered up long enough to finish any physical writes that were in
progress. This guarantees track integrity. It then powers down the
drives, but it KEEPS THE CACHE ON BACKUP POWER (for 4-5 days IIRC).
When power is restored, it finishes any pending writes.

This is pretty fancy, but it wasn't always the case. When you had a
power failure on earlier drives, you would end up with, well, junk. So
how did we get around this problem in the old days?

Let's look at an on-line system, like CICS. What if the power failure
wrote a track of crap in the middle of our 18 zillion record on-line
master file? We can't have that... So we do something that works like
this...

Anytime you do an "update" operation on a disk file, the system
records the update data in a work file. It guarantees that this data
will be physically written to the drive before any actual changes are
applied to the "master file". It then sets, and records, an "update in
progress" flag. If you have a power failure during the update, when
the system restarts, after power is restored, it checks the "in
progress" flag, and if true, fetches the new data from the work file,
and RE-RECORDS it, which again guarantees track integrity...

On a side note, with all due respect to those that have contributed to
CCKD, is it really worth it, considering today's PC tech? I mean,
terrabyte drives are mainstream today. Yea, they're still 2-300 bucks,
but in a few months, we'll be looking at $100. A terrabyts is, what,
20 x 3390-54s? :)

It is a great idea, but the days of 40G drives are long gone. Is it
worth the overhead???

Go ahead, flame away, my asbestos overalls have oak leaf clusters...
:) :) :)

BTS...
Post by Tony Harminc
I'm trying to understand how a host power failure can affect CCKD
write processing...
Tony Harminc
2008-09-04 16:35:08 UTC
Permalink
Post by BruceTSmith
Let's look at an on-line system, like CICS. What if the power failure
wrote a track of crap in the middle of our 18 zillion record on-line
master file? We can't have that... So we do something that works like
this...
Anytime you do an "update" operation on a disk file, the system
records the update data in a work file. It guarantees that this data
will be physically written to the drive before any actual changes are
applied to the "master file". It then sets, and records, an "update in
progress" flag. If you have a power failure during the update, when
the system restarts, after power is restored, it checks the "in
progress" flag, and if true, fetches the new data from the work file,
and RE-RECORDS it, which again guarantees track integrity...
Been there, done that. The problem, of course, is that the work file
is just as subject to corruption as is the master file. Way back when,
IMS used to use tape for the log file, because the behaviour on power
failure is simple and well understood. And further, in the days of
real core memory, IMS would be able to recover and redo data from a
standalone dump. (Well for all I know it still can.)

The problem in the instant case of Hercules CCKD recovery is that the
layers below the CCKD file format are un- or at least ill-specified,
since there are multiple operating systems with multiple file systems
in use. So failure may occur in ways much more subtle than "we
corrupted the track being written when the power went out but all else
is OK". The OS's cacheing scheme introduces unpredictability, as do
the various possibilities at the drive level. The CCKD scheme actually
does a remarkably good job of keeping things consistent given all of
the above.
Post by BruceTSmith
On a side note, with all due respect to those that have contributed to
CCKD, is it really worth it, considering today's PC tech? I mean,
terrabyte drives are mainstream today. Yea, they're still 2-300 bucks,
but in a few months, we'll be looking at $100. A terrabyts is, what,
20 x 3390-54s? :)
It is a great idea, but the days of 40G drives are long gone. Is it
worth the overhead???
In general, space saving has never been the only reason to compress
files. There is a tradeoff between CPU time used in compression, and
elapsed time saved in data transfer. This is a hard thing to evaluate,
and tends to swing back and forth as technology changes. Compressing
generally takes more CPU time than decompressing, but with cacheing,
the compression CPU time is not so time critical, so it may come down
to the speed of the decompressor vs the I/O bandwidth.

But in the case of CCKD, my primary reason for using it is the shadow
files. This is a brilliant invention that makes the Herc life flexible
and easy. Well why not just use CCKD without compression? It can be
argued that the compression acts as a very good checksum on the track
images. To be sure, there are weaknesses (as I've found out just
recently), but mostly it works very well. If CCKD had an explicit
checksum on track images, I might well not use compression. But even
calculating a checksum isn't free...
Post by BruceTSmith
Go ahead, flame away, my asbestos overalls have oak leaf clusters...
:) :) :)
No flames from me on this one.

Tony H.
Roger Bowler
2008-09-04 21:07:00 UTC
Permalink
Post by Tony Harminc
But in the case of CCKD, my primary reason for using it is the shadow
files. This is a brilliant invention that makes the Herc life flexible
So brilliant, indeed, that some *!§*µù of a *%ù£^^ has even tried to
patent the idea!

http://www.google.com/patents?id=_w-kAAAAEBAJ

You've got to admit, it takes some nerve to claim this as an original
invention. Hercules was using base and shadow files some 6 years before
this patent application was filed, and I don't suppose Hercules was to
first to have implemented the idea.
--
Cordialement,
Roger Bowler

roger.bowler-***@public.gmane.org
http://perso.wanadoo.fr/rbowler
Hercules "the people's mainframe"
Tony Harminc
2008-09-04 22:22:41 UTC
Permalink
Post by Roger Bowler
Post by Tony Harminc
But in the case of CCKD, my primary reason for using it is the shadow
files. This is a brilliant invention that makes the Herc life flexible
So brilliant, indeed, that some *!§*µù of a *%ù£^^ has even tried to
patent the idea!
http://www.google.com/patents?id=_w-kAAAAEBAJ
Wow! It's Jim Bergsten, developer of all kinds of well known VM stuff
from long long ago. Later worked for Gene Amdahl company Andor, and
various other storage places. No dummy, and an early open source
advocate. I'm surprised at the apparent lack of checking the prior
art.
Post by Roger Bowler
You've got to admit, it takes some nerve to claim this as an original
invention. Hercules was using base and shadow files some 6 years before
this patent application was filed, and I don't suppose Hercules was to
first to have implemented the idea.
It'd be one thing if this was a strictly hardware vendor type of
storage guy with no contact with the more general software community.
But he's a mainframe and PC software guy from way back. Unless there
are two James R. Bergstens in Danville, California...

Tony H.
Greg Smith
2008-09-04 23:34:57 UTC
Permalink
Post by Tony Harminc
Post by BruceTSmith
It is a great idea, but the days of 40G drives are long gone. Is it
worth the overhead???
In general, space saving has never been the only reason to compress
files. There is a tradeoff between CPU time used in compression, and
elapsed time saved in data transfer. This is a hard thing to evaluate,
and tends to swing back and forth as technology changes. Compressing
generally takes more CPU time than decompressing, but with cacheing,
the compression CPU time is not so time critical, so it may come down
to the speed of the decompressor vs the I/O bandwidth.
But in the case of CCKD, my primary reason for using it is the shadow
files. This is a brilliant invention that makes the Herc life flexible
and easy. Well why not just use CCKD without compression? It can be
argued that the compression acts as a very good checksum on the track
images. To be sure, there are weaknesses (as I've found out just
recently), but mostly it works very well. If CCKD had an explicit
checksum on track images, I might well not use compression. But even
calculating a checksum isn't free...
I wish I could take credit for the idea but I can't. It was suggested
by Malcolm Beattie and I just sort of ran with it:

http://tech.groups.yahoo.com/group/hercules-390/message/6695

When I first started on cckd 20G hard drives were large and expensive.
So the original motive was to reduce the size of those 3390-3 files
(which had to be broken into pieces since files > 2G weren't well
supported then).

The way things accumulate, I probably don't have enough room to make all
my disks uncompressed even though I have something like a half terrabyte
available.

As Tony says, it is possible to get better throughput using compressed
dasd because less data has to be read/written. I generally keep my base
files on a usb attached drive so that helps there. Compression is
indeed more expensive than decompression so that's why updated tracks
are `flushed' every 10 seconds (ie lazy write).

I'm not sure that compression/decompression adds an integrity factor to
the validity of a CKD track (versus an FBA block-group) because a lot of
things have to go just right for a CKD track to be valid. That is, a
count field has to indicate an offset to another count field (and the
count field has to have some sanity to it) and so on until the
end-of-track marker is reached which should be the last 8 bytes.

Greg
Loading...