Post by Greg SmithDon't get me wrong, but are you seriously arguing that having a cache
between disparate storage media is due to lazy programming ?? Why
do processors have a l1 and l2 cache ?? and disk control units a cache ??
No, that's not what I'm arguing. I think we are looking at this from
different angles.
It's a well-known system design principle that caching improves the
parallelism achievable in a system by helping to decouple
asynchronous subsystems with different price/performance tradeoffs.
You're talking about something that is fundamental to the design of
virtually all nontrivial computing systems.
The system gets probabalistic efficiency gains due to the fact that
references tend to be localized, and also due to the fact that
hardware accesses often involve multiple physical actions that must
be performed, the inertia of the actual mechanisms involved, and so
forth. These are all system design considerations. In fact, the
mainframe seems to be about the only system left that still tries to
optimize the physical hardware work (by reducing actuator motions
that must be carried out, etc.) Most other systems have "evolved" to
see the hardware as an abstract concept.
But I digress. My points are:
1. Caching does not obviate the need for I/O performance. It doesn't
even really reduce the need for I/O performance. The gains made by
caching can't generally be made by gains in I/O performance, and vice
versa. Caching is an important system consideration that is more or
less orthogonal to I/O performance.
2. Caching is a price/performance tradeoff that has a point of
diminishing returns. At some point, adding more cache costs more than
it provides. That suggests that there is a "right" amount of cache in
a system, beyond which it is a waste of resources to add more.
Finding the "right" point is a very complex and difficult task, which
is why there are system programmers who specialize in performance
management.
3. Caching is a system problem, not an application problem.
Applications should not do elaborate caching (beyond basic
buffering). Mainframe applications should minimize their impact to
system resources by using as little main storage as possible,
generally for their state data which have a tight locality of
reference, and keep the data they operate on in data sets. System
programmers or other users can decide whether those data sets should
reside on tape, DASD, hiperspace, main storage (VIO), etc.
I beleive that this discussion began with somebody saying that a good
way to improve I/O performance is to have a lot of cache RAM so as to
avoid doing any I/O at all. My response was meant to be something to
the effect of this: That's not really a good way to improve I/O
performance. It is simply a way to use more expensive hardware that
is faster instead of using cheaper hardware that is slower.
In other words, given a choice between a PC system with 256 MB of RAM
and 15 4 GB SCSI drives, and a PC system with 8 GB RAM and a single
60 GB IDE drive, I think the former would be capable of much greater
mainframe workloads with Hercules, considering that most commercial
workloads have a comparatively low reference locality and tend to be
I/O bound. I think the 15-fold increase in I/O parallelism buys more
scalability than the 16-fold increase in RAM.
When I say "greater workloads", I am talking about in a scenario
where the machine is doing many things at once.
To illustrate:
Imagine a job that processes 1 GB of data stored in a dataset, and
stores the 1 GB result in another dataset. Both datasets are
permanent. The job processes the records sequentially.
No matter how much caching the system can do, it is still necessary
to read 1 GB of data from DASD and ultimately to write 1 GB of data
back to DASD. The latter might happen in a "lazy writeback" system,
but it must still be done in order to ensure the data's consistency
if the system should suddenly crash or lose power. If you only ever
ran this one job on the system, the second time you ran it you would
avoid the need to read the data, but not the need to write the data.
Run the job cold on both of those systems. It may complete sooner on
the one with lots of memory, but it's not really complete because the
system still must write back all of the cached data to disk.
Ultimately, the systems perform somewhere pretty close to equally
with that job.
Now imagine you have 15 jobs like that. Each of them reads a
*different* 1 GB of data, and each of them produces a separate
dataset with 1 GB of data. Running in isolation, no job runs at
greater than 5% CPU utilization.
Run all 15 jobs at once on the system with 1 disk drive. The system
must allocate 30 datasets on the same drive (maybe different volumes,
but the same physical drive). Running 15 jobs at once has reduced the
amount of memory available for caching as well. The system must still
read in 15 GB of data, but it must all come from the one drive. The
single drive bottleneck means the CPU cannot be utilized to its full
potential (which should be 75% in this case). Caching cannot improve
this, because all of that data must be brought in from DASD before it
is in the cache, and we are only going to read it once. Once all of
the jobs finish generating their result sets, 15 GB of data must be
written back to the single drive. Caching can't improve that either,
it can only delay it. Even though the system has 16 times as much
memory available, it will still take at least 15 times as long to
complete all 15 jobs as it would have to complete one due to the fact
that they are all sharing a single drive. Matters are likely made
worse by the fact that the drive actuator is thrashing more.
Run them on the system with 15 drives, with a different drive per
job. Each drive has two datasets allocated to it. On this system, CPU
utilization goes right up to 75%, and each job utilizes a single
drive to the same extent that it would have if it were the only job
running on the system. All 15 jobs complete in the same amount of
time that it would have taken to complete only one of them.
But what if you preloaded all of the data and locked it in memory
(e.g. in hiperspaces or somesuch)? What if you used VIO for the
output data sets instead of DASD. Well, then you're just using more
expensive storage to do the job. Yes, any single job will run faster
if you throw more money at it. But if your goal is to put together a
multi-purpose system with the idea that it should be able to do as
much work as possible for your money (i.e. "bang for the buck"), a
high-performance DASD subsystem is a whole lot more cost effective
than a bunch of RAM. Also consider that a regular PC can't really
address 8 GB of RAM. It runs into a limitation at 4 GB that requires
special hardware to surpasss, costing even more.
I tend to view it like this: If I have a single job that is I/O
bound, and it completes in an acceptable amount of time using DASD
I/O, then I can run some large number of those jobs in parallel on
the same system as long as I have enough drives. Each will still
complete in about the same time it would have if nothing else were
happening on the system. As long as that number is acceptable, the
system is scalable in a way that is much more deterministic (i.e.
guaranteed) than trying to throw a lot of RAM at the problem. It's
not a question of trying to get one job to run as quickly as
possible. It's a question of trying to get the most possible work out
of the system.
Post by Greg SmithI'm a bottom-up type of programmer. If my choice is coding read()/
write() or performing a search on some in-storage array that might
already have my data, then I'll burn the cpu to search the array as long
as the ratio of disk access time vs cpu time is great enough.
So would I. But the choice of whether to have all of the data in an
array in the first place is a higher level design decision. When
processing data of some arbitrary size, do I dynamically allocate a
big buffer, pull it in from disk, and then do a bunch of work on it
in memory, or do I seek around the data on disk and do the work on
small chunks of it brought into fixed sized buffers? The former
trades machine resources to get speed. It will execute faster, and it
will take more memory. If the data set is truly arbitrarily sized,
then that makes it much worse because the memory usage of the program
is open-ended, meaning its worst case usage cannot be predicted at
design time. Neither choice is always right, but it should be
considered seriously at design time with an eye to the tradeoffs
involved. If the latter approach allows the program to complete in an
acceptable period of time, then it is probably a much better approach
since it makes more efficient use of machine resources.
Post by Greg SmithI bought
1G memory for my 3 yr old dual piii 850mhz machine a while ago for
130usd. I don't consider that *that* expensive.
Memory is expensive in many ways. Its cost per byte is still much
greater than disk space. Then there is the fact that the system can
only address a small, finite quantity of it (4 GB). If you want to
use the machine to process more than 4 GB of data at once, some of
that data will have to be in some other storage medium. At that
point, it is a good idea to keep the more important things in memory
and the less important things on disk. If you already have that
discipline in your application programming, then you already have a
system that scales up much bigger. Then there is the locality of
reference issue. Accessing a larger amount of memory at once results
in more cache misses, which dramatically slows the processor's
instruction rate. Cache misses are synchronous hits to CPU execution
(meaning they have to be considered a cost in terms of CPU cycles),
while I/O is always asynchronous. There is the allocator overhead.
Since RAM costs more per byte, it is desirable to use sophisticated
schemes to reduce or eliminate slack space. Those allocation
algorithms tend to have a much greater cost in CPU cycles than DASD
storage management, since it is acceptable to waste more of the
latter in order to reduce CPU usage.
There is also the important point that using a lot of memory does not
change the fact that the data must end up on disk anyway in order to
be in a permanent form, so there is some I/O involved even if you
wanted to have it all in memory all the time.
In general, system designs have evolved along a line of counting main
storage as a relatively small, finite, temporary, relatively
expensive storage medium, with a hierarchy of cheaper and more
permanent, but slower, storage mediums beyond it.
I think in many places there has been a trend toward programs (and
even system designs) that consider memory to be cheap and disdain I/O
as being expensive, and I think this trend has had a negative impact
on the overall efficiency, cost, and scalability of our systems.
Post by Greg SmithYou are right in the sense that cache shouldn't be blindly applied to
solve a problem. But, it seems, you are making judgement calls
against
Post by Greg Smithcode that you admittedly haven't even looked at.
Fair enough. But I didn't think we were talking about a specific
piece of code. This discussion began when I observed that I/O system
performance is important to the performance of an emulated mainframe,
and somebody suggested that perhaps having a lot of RAM would be a
better use of your money (when putting together a Hercules system)
than SCSI drives, etc. I only meant to say I disagree with that
statement.
Post by Greg SmithI don't blindly make
coding decisions. I take measurements, I trace the code, I examine the
assembler. In some complicated tasks, like garbage collection, my
intuition as to what should work best is shown wrong.
I don't know you very well, but just from talking with you I would
tend to assume you are careful and astute. I never meant to suggest
otherwise.
I've seen a lot of people put long hours into profiling and tweaking
something so that its execution time in a vacuum is as short as
possible. I think that's the wrong thing to be profiling and
optimizing in the first place. There is a tradeoff between turnaround
time and resource usage that should be worked until the code in
question uses the least possible system resources it can use for an
acceptable turnaround time in real-world usage. That's a lot more
complicated of a problem than getting it to go as fast as possible in
isolation, but I suggest it is "the stuff" of performance management.
Post by Greg SmithIf you are serious that caching may be misapplied in hercules code
then please cite some examples.
I never meant to suggest that. I was trying to say that if I were to
advise where to put your money into an emulated mainframe system to
get good performance, I'd spend more on the I/O and disk drives than
on the memory. That's for any emulated mainframe, whether Hercules or
FLEX-ES. I haven't looked at the Hercules code, but it seems to work
quite well for the limited amount of stuff I've done with it so far.
Post by Greg SmithRemember, hercules can run on, eg, linux-390. I can define my
emulated
Post by Greg Smithdisks to be on a raid0 filesystem that spans multiple volumes across
multiple controllers and chpids. Or everything can be on a `lousy' ide
controller on my pc, which gets, btw, about 20MB/s.
True. Even better, you can define them to be on individual disks. I
am going with IDE RAID for my Hercules box, but I think you'd get
better performance going SCSI with a bunch of smaller drives (say 4-8
GB), and splitting your DASD between them. RAID is a case of taking a
bunch of slow, parallel things and converting them to a single fast,
serial thing. I think they are more advantageous as slow, parallel
things. For example, every drive in the array must seek on every
access in a RAID system. If you split DASD between the drives, a
single program can process data sequentially on a single drive
without a seek between each read or write, and without affecting the
performance of other programs at all. Also, disk units nowadays have
caches and read-ahead logic that works much bettern when each disk is
dedicated to a small number of tasks.
One of my favorite analogies is the laundry. If you were designing a
public laundry facility that could handle 6 customers per hour, would
it be better to have one washer that completes a load in 10 minutes,
or 6 washers, each of which can complete a load in an hour, assuming
the cost is the same either way?
--Dan
------------------------ Yahoo! Groups Sponsor ---------------------~-->
Get 128 Bit SSL Encryption!
http://us.click.yahoo.com/CBxunD/vN2EAA/xGHJAA/W4wwlB/TM
---------------------------------------------------------------------~->