Discussion:
OpenVMS Clusters - maximum data redundancy factors
(too old to reply)
IanD
2017-02-08 11:26:42 UTC
Permalink
I've just been watching some video propaganda put out by Nutanix

I have to say it's pretty dam impressive




Looking at how they do their clustering, I was thinking about VMS clustering, especially in relation to how Natanix use erasure encoding instead of RAID, which allows a far more efficient use of disk space and better disk utilization

Then I got to thinking about VMS clusters and while we can have 96 nodes in a cluster, really the maximum number of workable nodes in a cluster that would absolutely guarantee data redundancy from a VMS management perspective is really only 6 nodes, a far cry from 96 (maximum number of disks in a shadow set under 8.4 is 6)

Is 6 nodes really the maximum concrete data redundancy factor in VMS clusters?

Sure, one could replicate SAN's etc but I'm meaning the maximum disk redundancy as seen and managed by VMS

Is there some other combination of disk / cluster members that I am overlooking that would be fully manageable by VMS and would improve on this redundancy?

Since one cannot shadow, a shadow set as such, this to me means that to have 100% data redundancy that is managed fully by VMS, there is little point going beyond a 6 node cluster (each node mounting a shadow set disk member into a clustered virtual shadow set volume)

Is there any point expanding shadowing to go beyond 6 shadow set members or would VMS be better served going down the same path as Nutanix and look at erasure encoding for future disk management and data redundancy? Erasure encoding scales well beyond where raid fears to tread

As VMS grows up and starts to work with larger and larger disk sizes going forward, are concepts like erasure encoding factors under consideration or is the market segment that VMS will be pitched for, not needing this type of data technology and scale?

Looking at Nutanix's offerings, they seem to do a lot of what VMS does with it's cluster server and then some

I wonder how difficult it would be to tease out the VMS cluster manager processes the same way Nutanix does with their CVM processes. VMS might one day be able to deploy it's clustering technology over other non-VMS platforms :-)

Of course, when we see things like huge memory machines then even concepts like Nutanix may be obsoleted but they would be the closest to being able to adapt to such a memory machine that I have seen so far
Phillip Helbig (undress to reply)
2017-02-08 13:12:13 UTC
Permalink
In article <d12ac2c2-39a5-4e8b-b89d-***@googlegroups.com>, IanD
<***@gmail.com> writes:

First, please post short lines (whatever it looks like in your client)
to avoid the automatic base-64 encoding, visible in the equal signs
below where lines have been broken.
Then I got to thinking about VMS clusters and while we can have 96 nodes in=
a cluster, really the maximum number of workable nodes in a cluster that w=
ould absolutely guarantee data redundancy from a VMS management perspective=
is really only 6 nodes, a far cry from 96 (maximum number of disks in a sh=
adow set under 8.4 is 6)
Is 6 nodes really the maximum concrete data redundancy factor in VMS cluste=
rs?
Of course, data redundancy is not the only reason to have many nodes.
Is there some other combination of disk / cluster members that I am overloo=
king that would be fully manageable by VMS and would improve on this redund=
ancy?
Since one cannot shadow, a shadow set as such, this to me means that to hav=
e 100% data redundancy that is managed fully by VMS, there is little point =
going beyond a 6 node cluster (each node mounting a shadow set disk member =
into a clustered virtual shadow set volume)=20
Is there any point expanding shadowing to go beyond 6 shadow set members or=
would VMS be better served going down the same path as Nutanix and look at=
erasure encoding for future disk management and data redundancy? Erasure e=
ncoding scales well beyond where raid fears to tread
I don't see why you need more than 6. In fact, for most cases, 3 is
enough. Of course, each member should be at a different location. If
you lose more than one location, then you probably have even bigger
problems. (Of course, the locations shouldn't be too close together.
There was a "disaster-tolerant" cluster with two sites: one in each
building of the World Trade Center.)
m***@googlemail.com
2017-02-09 02:18:58 UTC
Permalink
On Wednesday, February 8, 2017 at 9:12:14 PM UTC+8, Phillip Helbig (undress >
Post by Phillip Helbig (undress to reply)
I don't see why you need more than 6.
:-(

Scalability, on-demand resource allocation increased resilience DR etc load-balancing. Geographic locality latency reduction.
Phillip Helbig (undress to reply)
2017-02-09 09:42:09 UTC
Permalink
Post by m***@googlemail.com
On Wednesday, February 8, 2017 at 9:12:14 PM UTC+8, Phillip Helbig (undress >
Post by Phillip Helbig (undress to reply)
I don't see why you need more than 6.
:-(
Scalability, on-demand resource allocation increased resilience DR etc
load-balancing. Geographic locality latency reduction.
Yes, but not redundancy (mentioned about half a dozen times in the
original post), which was the original question. Of course, local
access can boost read speeds, but the more members, the slower the write
speeds.
Stephen Hoffman
2017-02-10 22:12:28 UTC
Permalink
Post by m***@googlemail.com
On Wednesday, February 8, 2017 at 9:12:14 PM UTC+8, Phillip Helbig (undress >
Post by Phillip Helbig (undress to reply)
I don't see why you need more than 6.
:-(
Scalability, on-demand resource allocation increased resilience DR etc
load-balancing.
Ayup...

HBVS helps for maintaining availability of disk data, certainly.

HBVS is also helpful for distributing the read I/O load, but not so
good as the write I/O load increases.

All writes have to be written to and complete to all members, and the
acknowledgements have to return to the writer. Your locking activity
also climbs.

As you add disks to your shadowset and as you increase your network
configuration complexity and network distances, your write I/O capacity
correspondingly drops.
Post by m***@googlemail.com
Geographic locality latency reduction.
Which doesn't work well with HBVS, particularly as the write I/O
activity slows processing and increases network load.

HBVS is great and works well and it's still an unusual and useful
feature of OpenVMS, but I'm not at all certain that software RAID is
the path forward here.
--
Pure Personal Opinion | HoffmanLabs LLC
Kerry Main
2017-02-11 14:11:13 UTC
Permalink
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 10, 2017 5:12 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by m***@googlemail.com
On Wednesday, February 8, 2017 at 9:12:14 PM UTC+8, Phillip Helbig (undress >
Post by Phillip Helbig (undress to reply)
I don't see why you need more than 6.
:-(
Scalability, on-demand resource allocation increased resilience DR etc
load-balancing.
Ayup...
HBVS helps for maintaining availability of disk data, certainly.
HBVS is also helpful for distributing the read I/O load, but not so
good
as the write I/O load increases.
All writes have to be written to and complete to all members, and the
acknowledgements have to return to the writer. Your locking
activity
also climbs.
The same applies to HW RAID and associated sync technologies.
As you add disks to your shadowset and as you increase your network
configuration complexity and network distances, your write I/O
capacity correspondingly drops.
The same applies to HW RAID and associated sync technologies.
Post by m***@googlemail.com
Geographic locality latency reduction.
Which doesn't work well with HBVS, particularly as the write I/O
activity slows processing and increases network load.
The same applies to HW RAID and associated sync technologies.
HBVS is great and works well and it's still an unusual and useful
feature
of OpenVMS, but I'm not at all certain that software RAID is the
path
forward here.
I disagree.

Multi-site data consistency is a critical component of many business
critical solutions today - especially when one requires active-active
multi-site solutions. In todays global world, combined with
Public/Private cloud considerations, the push is increasingly moving
to "always on, always available" solutions.

For data consistency, there are really only 2 options - HW or SW
based. Both are either async (data buffers write to remote) or sync
(writes completes at both sides before considered complete) or some
combination. Regardless of the vendor, both typically have licensing
costs associated with them.

[side note - In a future VSI pricing model, in order to vastly
simplify overall licensing complexity, I would love to see
HBVS/Clustering (+ other components) integrated into an "enterprise"
monthly support OpenVMS model cost.]

Certainly if data consistency and low RPO (recovery point objective)
requirements are NOT a critical part of the App design, then concepts
like "eventual data consistency" (love that concept) and async
replication solutions can be applied. Many distributed app designs use
this "eventual data consistency" model as their only real data
consistency option is async data replication. Typically such an app
can be made to work, but has major challenges when looking at mission
critical multi-site DR options with low RPO requirements.

If data consistency is critical requirement, then HBVS (SW sync) is
absolutely a good option to consider.

Having stated this, and the same is true for HW RAID/Sync solutions,
if the read to write ratio of the App is 50-50 (or worse with higher %
of writes), then the acceptable distance between the sites becomes
much less. This is because the latency delays of the remote writes
becomes a bigger impact on the overall performance.

Can HBVS features be improved - absolutely?

There is overhead for multi-volume HBVS sets, but perhaps a future
option might include this and other cluster DLM traffic being handled
via RoCEV2 (on VSI's latest exploratory technologies road map)?

I would also like to see HBVS features (licensed or purchased or ??)
to include async options as well as the ability to remotely shadow
only sub parts of a disk e.g. perhaps a directory only.

This is what a Third Party Host Based Shadowing Option for OpenVMS
provides: RemoteShadow from Advanced Systems Concepts (km - not sure
if still available?)
https://www.advsyscon.com/en-us/products/remoteshadow/remoteshadow-des
cription.aspx
https://www.advsyscon.com/home/products/rso/pdf/remoteshadow%202006%20
website.pdf

Interesting read on OpenVMS HBVS history:
http://www.hpl.hp.com/hpjournal/dtj/vol6num1/vol6num1art3.pdf

Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Stephen Hoffman
2017-02-11 18:05:34 UTC
Permalink
Post by Kerry Main
The same applies to HW RAID and associated sync technologies.
The difference being that the RAID coordination is not running
remotely; across links. Within HBVS, all writes have to complete.
With larger and more complex HBVS configurations and particularly with
remote network links, those write I/O operations further reduce the
application performance, or some of the certainty has to be deferred.

As Stark starts to build up and inherently also build out given your
business scope and the locations of your customers, you'll be
encountering these issues. At least if you're continuing with the
designs from your presentation of a few years ago. Which may well
result in Stark going with geographic centers and deferred
synchronization of the data.
Post by Kerry Main
Multi-site data consistency is a critical component of many business
critical solutions today - especially when one requires active-active
multi-site solutions. In todays global world, combined with
Public/Private cloud considerations, the push is increasingly moving to
"always on, always available" solutions.
This isn't an easy problem, and HBVS is a good solution for small
configurations and with local servers and a few remote sites, all with
high-bandwidth and low-latency network pipes. But HBVS will still run
into increasingly expensive network links and basic physics, and users
will be figuring out how to split up your load. There's no way around
that happening with HBVS, either. Sure, for a traditional — what's
now small — configuration, HBVS works really well. But as the server
counts and the geographic spans increase — and as the I/O loads
increase the link costs — there'll be a push away from HBVS. But to
CAP it all off, this whole area tends to become an ACID trip.
--
Pure Personal Opinion | HoffmanLabs LLC
David Froble
2017-02-11 20:19:52 UTC
Permalink
Post by Kerry Main
The same applies to HW RAID and associated sync technologies.
The difference being that the RAID coordination is not running remotely;
across links. Within HBVS, all writes have to complete. With larger
and more complex HBVS configurations and particularly with remote
network links, those write I/O operations further reduce the application
performance, or some of the certainty has to be deferred.
As Stark starts to build up and inherently also build out given your
business scope and the locations of your customers, you'll be
encountering these issues. At least if you're continuing with the
designs from your presentation of a few years ago. Which may well
result in Stark going with geographic centers and deferred
synchronization of the data.
Post by Kerry Main
Multi-site data consistency is a critical component of many business
critical solutions today - especially when one requires active-active
multi-site solutions. In todays global world, combined with
Public/Private cloud considerations, the push is increasingly moving
to "always on, always available" solutions.
This isn't an easy problem, and HBVS is a good solution for small
configurations and with local servers and a few remote sites, all with
high-bandwidth and low-latency network pipes. But HBVS will still run
into increasingly expensive network links and basic physics, and users
will be figuring out how to split up your load. There's no way around
that happening with HBVS, either. Sure, for a traditional — what's now
small — configuration, HBVS works really well. But as the server
counts and the geographic spans increase — and as the I/O loads increase
the link costs — there'll be a push away from HBVS. But to CAP it all
off, this whole area tends to become an ACID trip.
Well, this is mainly a storage discussion.

How about when the request comes down to reorganize (similar to RMS CONVERT) a
data file while still open and being accessed? I looked at it for a while, and
then "just said NO!".

There is much more to data integrity than just secure storage.
Kerry Main
2017-02-11 20:15:48 UTC
Permalink
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 11, 2017 1:06 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by Kerry Main
The same applies to HW RAID and associated sync technologies.
The difference being that the RAID coordination is not running
remotely; across links. Within HBVS, all writes have to complete.
With larger and more complex HBVS configurations and particularly
with remote network links, those write I/O operations further reduce
the application performance, or some of the certainty has to be
deferred.
The same applies to SAN HW multi-site sync/asynch replication technologies .. they operate on network links as well.

And on a side note - pricing for EMC/HPE multi-site storage replication technologies are typically much greater than HBVS licensing.
As Stark starts to build up and inherently also build out given your
business scope and the locations of your customers, you'll be
encountering these issues. At least if you're continuing with the
designs from your presentation of a few years ago. Which may well
result in Stark going with geographic centers and deferred
synchronization of the data.
Any inter-DC site solution whereby sites are more than 100km apart will typically require creative solutions that look at overall solution latency - not just the network WAN.

One way of reducing the impact of todays slow LAN latency is to reduce the number of many small servers (current thinking) and replace them with much fewer, but much larger servers interconnected with much higher bandwidth, much tighter cluster communications and much lower latency interconnections such as RoCEV2 or Infiniband.

Infiniband Breaks the 200G Barrier:
https://www.nextplatform.com/2016/11/10/infiniband-breaks-200g-barrier/
" It is hard for people to let anything but Ethernet into their networks, but once they do, it probably gets a lot easier. The support of RDMA over Converged Ethernet (RoCE) mitigated this to a certain extent, but when HPC shops, hyperscalers, and cloud builders see they can get 200 Gb/sec InfiniBand in 2017 and 400 Gb/sec InfiniBand in 2019, they might have a rethink."
Post by Kerry Main
Multi-site data consistency is a critical component of many business
critical solutions today - especially when one requires active-active
multi-site solutions. In todays global world, combined with
Public/Private cloud considerations, the push is increasingly moving
to "always on, always available" solutions.
This isn't an easy problem, and HBVS is a good solution for small
configurations and with local servers and a few remote sites, all with
high-bandwidth and low-latency network pipes. But HBVS will still run
into increasingly expensive network links and basic physics, and users
will be figuring out how to split up your load. There's no way around
that happening with HBVS, either. Sure, for a traditional — what's
now small — configuration, HBVS works really well. But as the server
counts and the geographic spans increase — and as the I/O loads
increase the link costs — there'll be a push away from HBVS. But to
CAP it all off, this whole area tends to become an ACID trip.
There is no one solution that is best for all Application environments.

I do know of one past Cust that used Advanced Systems Concepts RemoteShadow (see my prev post) third party shadowing product on their mission critical OpenVMS cluster.
https://www.advsyscon.com/en-us/products/remoteshadow/remoteshadow-overview.aspx

Key reasons they used RemoteShadow for was:
- provided both sync (HBVS like) and async (replication) shadowing features. Made for extremely flexible local and remote configurations using the same product.
- provided capability to remotely shadow (sync or async) smaller parts of a disk e.g. just a directory
- had nice logging and troubleshooting features.

They also minimized the issues of host based interaction when a drive failed by using RemoteShadow with each volume presented to OpenVMS being a HW RAID device. Hence, a drive failure was replaced and rebuilt at the SAN level only. No host intervention required.

Note - at the time, it was also very stable i.e. the Cust stated they seldom has support issues with this product.

Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Stephen Hoffman
2017-02-12 02:51:45 UTC
Permalink
Post by Kerry Main
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 11, 2017 1:06 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by Kerry Main
The same applies to HW RAID and associated sync technologies.
The difference being that the RAID coordination is not running
remotely; across links. Within HBVS, all writes have to complete.
With larger and more complex HBVS configurations and particularly with
remote network links, those write I/O operations further reduce the
application performance, or some of the certainty has to be deferred.
The same applies to SAN HW multi-site sync/asynch replication
technologies .. they operate on network links as well.
HBVS is an old design. A good one, certainly. Many of those old
designs do work, too. But sooner or later, various of those older
designs work less well in newer situations. I don't think HBVS is the
path forward. It's fundamentally file-based sharing, for starters.
It's passing around application data in units of a half kilobyte or
(hopefully soon) four kilobytes, or more. This irrespective of how
much application data is involved, or how many hosts are active in the
configuration.. It's entirely synchronous writes across all member
volumes. Into external hardware storage caches on the far end of an
I/O bus, across a network — which the remote application then has to go
fetch, particularly when running from memory. No obvious mechanism
for retrofitting RDMA or otherwise mirroring or journaling changes to
in-memory data structures, either. No data compression. No means
for rollback and recovery, either — minicopy and minimerge are nice,
but they're utterly divorced from what the applications are doing.
Where HBVS works and where it's a comfortable and familiar abstraction,
have at. But this whole HBVS abstraction reminds me of X11, in terms
of being ripe for a replacement design for synchronizing data across
hosts, and preferably separating out the file system from the
synchronization. How this replacement might work and what APIs are
presented, I don't know. And to be absolutely clear, I see no reason
to remove HBVS.
--
Pure Personal Opinion | HoffmanLabs LLC
Michael Moroney
2017-02-12 05:19:54 UTC
Permalink
Post by Stephen Hoffman
It's passing around application data in units of a half kilobyte or
(hopefully soon) four kilobytes, or more.
The data chunk is up to the max block count of the member drives.
Post by Stephen Hoffman
It's entirely synchronous writes across all member
volumes.
Writes are issued asychronously so writes complete at the speed of the
slowest drive (+ network latency)
Kerry Main
2017-02-12 15:14:56 UTC
Permalink
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 11, 2017 9:52 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by Kerry Main
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 11, 2017 1:06 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy
Post by Kerry Main
factors
Post by Kerry Main
The same applies to HW RAID and associated sync technologies.
The difference being that the RAID coordination is not running
remotely; across links. Within HBVS, all writes have to complete.
With larger and more complex HBVS configurations and particularly
with remote network links, those write I/O operations further
reduce
Post by Kerry Main
the application performance, or some of the certainty has to be
deferred.
Post by Kerry Main
The same applies to SAN HW multi-site sync/asynch replication
technologies .. they operate on network links as well.
HBVS is an old design. A good one, certainly. Many of those old
designs do work, too. But sooner or later, various of those older
designs work less well in newer situations. I don't think HBVS is the
path forward. It's fundamentally file-based sharing, for starters.
It's passing around application data in units of a half kilobyte or
(hopefully soon) four kilobytes, or more. This irrespective of how
much application data is involved, or how many hosts are active in the
configuration.. It's entirely synchronous writes across all member
volumes. Into external hardware storage caches on the far end of an
I/O bus, across a network — which the remote application then has to go
fetch, particularly when running from memory. No obvious
mechanism
for retrofitting RDMA or otherwise mirroring or journaling changes to
in-memory data structures, either. No data compression. No means
for rollback and recovery, either — minicopy and minimerge are nice,
but they're utterly divorced from what the applications are doing.
Where HBVS works and where it's a comfortable and familiar
abstraction,
have at. But this whole HBVS abstraction reminds me of X11, in terms
of being ripe for a replacement design for synchronizing data across
hosts, and preferably separating out the file system from the
synchronization. How this replacement might work and what APIs are
presented, I don't know. And to be absolutely clear, I see no reason
to remove HBVS.
Imho, the real question is how to enhance HBVS to create additional value add features?

Hence, the primary reasons (async option, directory level replication option, shadow detailed reporting) I suggest taking a look at the Virtuoso / RemoteShadow products from Advanced Systems Concepts:

https://www.advsyscon.com/en-us/products/virtuoso

https://www.advsyscon.com/en-us/products/remoteshadow/remoteshadow-overview.aspx

https://www.advsyscon.com/home/products/rso/pdf/shadow%20for%20openvms%20spd.pdf

Not sure where this product is in terms of availability today, but as mentioned before, I know one mission critical OpenVMS Cust that ran it in their very busy cluster and really liked it.

Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Stephen Hoffman
2017-02-12 19:27:03 UTC
Permalink
Post by Kerry Main
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 11, 2017 9:52 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
HBVS is an old design. A good one, certainly. Many of those old
designs do work, too. But sooner or later, various of those older
designs work less well in newer situations. I don't think HBVS is the
path forward. It's fundamentally file-based sharing, for starters.
It's passing around application data in units of a half kilobyte or
(hopefully soon) four kilobytes,
My bad: kibibytes, not kilobytes. That's with 512 bytes for old
disks and AF 512E disks, and 4KiB for current-generation HDD and SSD AF
storage.
Post by Kerry Main
or more. This irrespective of how much application data is involved,
or how many hosts are active in the configuration.. It's entirely
synchronous writes across all member volumes. Into external hardware
storage caches on the far end of an I/O bus, across a network — which
the remote application then has to go fetch, particularly when running
from memory. No obvious mechanism for retrofitting RDMA or otherwise
mirroring or journaling changes to in-memory data structures, either.
No data compression. No means for rollback and recovery, either —
minicopy and minimerge are nice, but they're utterly divorced from what
the applications are doing. Where HBVS works and where it's a
comfortable and familiar abstraction, have at. But this whole HBVS
abstraction reminds me of X11, in terms of being ripe for a replacement
design for synchronizing data across hosts, and preferably separating
out the file system from the synchronization. How this replacement
might work and what APIs are presented, I don't know. And to be
absolutely clear, I see no reason to remove HBVS.
Imho, the real question is how to enhance HBVS to create additional value add features?
"We shape our tools and thereafter our tools shape us."

I'd prefer to shape a new tool here, and not try to further bend the
storage abstraction used by HBVS into the basis for distributed
programming on OpenVMS.

For the sorts of apps and designs I'm increasingly working with, I
don't find the abstractions provided by HBVS to be particularly
appropriate. It's what's available, and it does work for those cases
where I need sector-based storage-level sharing, so I do use HBVS for
those cases and reasons. DECdtm, message queues, RTR, distributed
logging, those are where the apps I'm dealing with are headed, or are
already using. RDMA, as and where that is available. None of which
are particularly tied to classic OpenVMS clustering, either. Even
with OpenVMS and due in no small part to the license prices, I'm
increasingly finding myself operating across and replicating
non-clustered servers. Where HBVS doesn't do much better than
hardware RAID.

As for enhancing HBVS, I'd be interested in ZFS for that role. As a
wholesale replacement. But that's rather further into some potential
future, given current schedules and constraints at VSI.

But to be absolutely clear again, I see no reason to remove HBVS.
--
Pure Personal Opinion | HoffmanLabs LLC
Bart Zorn
2017-02-12 08:17:48 UTC
Permalink
On Saturday, February 11, 2017 at 3:15:04 PM UTC+1, Kerry Main wrote:

[ S n i p . . . ]
Post by Kerry Main
http://www.hpl.hp.com/hpjournal/dtj/vol6num1/vol6num1art3.pdf
Regards,
Kerry Main
Kerry dot main at starkgaming dot com
A version of this article which includes the figures, can be found at:

https://www.linux-mips.org/pub/linux/mips/people/macro/DEC/DTJ/DTJD03/DTJD03PF.PDF

Regards,

Bart
Hans Bachner
2017-02-09 22:59:07 UTC
Permalink
[snip] >
Then I got to thinking about VMS clusters and while we can have 96 nodes in a cluster, really the maximum number of workable nodes in a cluster that would absolutely guarantee data redundancy from a VMS management perspective is really only 6 nodes, a far cry from 96 (maximum number of disks in a shadow set under 8.4 is 6)
Is 6 nodes really the maximum concrete data redundancy factor in VMS clusters?
I don't see why you would limit the useful number of nodes in a cluster
to the maximum number of shadow set members.

As far as redundancy goes, six is the maximum number of copies of data
you can have with VMS. But then, in today's world a shadow set member
has additional redundancy built into the storage system providing the
LUN for this member.

Hans.
IanD
2017-02-15 02:14:15 UTC
Permalink
Post by Hans Bachner
[snip] >
Then I got to thinking about VMS clusters and while we can have 96 nodes in a cluster, really the maximum number of workable nodes in a cluster that would absolutely guarantee data redundancy from a VMS management perspective is really only 6 nodes, a far cry from 96 (maximum number of disks in a shadow set under 8.4 is 6)
Is 6 nodes really the maximum concrete data redundancy factor in VMS clusters?
I don't see why you would limit the useful number of nodes in a cluster
to the maximum number of shadow set members.
But I didn't :-)

What I said was as far a data redundancy is concerned

i.e. if i have a 90 node cluster but my data is replicated a maximum of 6 times, then isn't 6 the maximum amount of data redundancy I can have?

I was trying to ascertain if 6 was indeed the maximum or whether there was some other combination I was unaware of 'as far as VMS is concerned (.e. under direct VMS control)
Post by Hans Bachner
As far as redundancy goes, six is the maximum number of copies of data
you can have with VMS. But then, in today's world a shadow set member
has additional redundancy built into the storage system providing the
LUN for this member.
Hans.
I did mention 'as far as VMS was concerned'. I'm quite aware of storage replication etc

As far as VMS nodes and merely adding nodes to a cluster, if one ignores the data redundancy aspect (which is what I was interested in), then clusters themselves don't get you any process redundancy either, since there is no process failover

You get application availability to be sure but lots of other systems give you that now anyhow
Stephen Hoffman
2017-02-15 19:52:13 UTC
Permalink
Post by IanD
i.e. if i have a 90 node cluster but my data is replicated a maximum of
6 times, then isn't 6 the maximum amount of data redundancy I can have?
If your design is centered on host-based volume shadowing and using the
file system and SSD or HDD storage as a persistent network store layer
— running all your data onto the file system and then reading that data
back out to your other cluster members — then yes. Six member volumes
is the maximum for HBVS, with V8.4 and later. The limit was three on
earlier releases.

While SSD really helps here, everybody writing to the same disks — HBVS
or otherwise — invariably and inevitably runs into a very solid
performance wall, too. That access has to be coordinated across all
the hosts involved, which means the lock manager gets Really Busy,
assuming the traffic doesn't simply throttle based on the bandwidth of
some associated storage device or storage link. This is then where
data gets sharded, or where writes are sent via a subset of the servers.

You're also headed for problems getting a consistent backup and
recovery of this approach, too. I really like what HBVS provides —
it's a powerful and really easy abstraction — but I also really like
the ability to get a consistent and restorable backup of my data, and
I'm not too fond of running all my sharing traffic through the file
system. HBVS can work, if you can quiesce the applications. But
neither HBVS nor the file system provides any sort of transactional
integration, short of adding RMS journaling or various add-on databases
into the mix.

Performance and archival backup and recovery considerations aside,
there aren't many folks that can afford (and are willing to pay for) a
90 node cluster, either.
--
Pure Personal Opinion | HoffmanLabs LLC
Kerry Main
2017-02-16 02:22:04 UTC
Permalink
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 15, 2017 2:52 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by IanD
i.e. if i have a 90 node cluster but my data is replicated a maximum of
6 times, then isn't 6 the maximum amount of data redundancy I can
have?
If your design is centered on host-based volume shadowing and using
the file system and SSD or HDD storage as a persistent network store
layer — running all your data onto the file system and then reading
that data
back out to your other cluster members — then yes. Six member volumes
is the maximum for HBVS, with V8.4 and later. The limit was three on
earlier releases.
While SSD really helps here, everybody writing to the same disks —
HBVS or otherwise — invariably and inevitably runs into a very solid
performance wall, too. That access has to be coordinated across all the
hosts involved, which means the lock manager gets Really Busy,
assuming the traffic doesn't simply throttle based on the bandwidth of
some associated storage device or storage link. This is then where
data gets sharded, or where writes are sent via a subset of the servers.
The same performance issue can be seen with a shared nothing approach (UNIX, Windows etc.) as well i.e. data sharding is split across many nodes, but if one aspect of the application all of a sudden gets real busy with updates, then the node hosting that particular subset of data can get hammered.

This scenario is the 800lb gorilla concern with things like NonStop applications. In such distributed applications, you need to very carefully plan your application and have a very good understanding of what data is hosted where. If a hotspot creeps in with updating one node's data, then the only option is to upgrade that node/storage or re-distribute the workload in the cluster (potentially major downtime).
You're also headed for problems getting a consistent backup and
recovery of this approach, too. I really like what HBVS provides — it's a
powerful and really easy abstraction — but I also really like the ability
to get a consistent and restorable backup of my data, and I'm not too
fond of running all my sharing traffic through the file
system. HBVS can work, if you can quiesce the applications. But
neither HBVS nor the file system provides any sort of transactional
integration, short of adding RMS journaling or various add-on
databases into the mix.
Performance and archival backup and recovery considerations aside,
there aren't many folks that can afford (and are willing to pay for) a
90 node cluster, either.
I do agree the cluster licensing model needs to be addressed. When one of the core strengths of a platform is not utilized because it is to expensive, then that is an issue.

Btw, on a separate thought - lets not forget that because disk is so cheap now, many sites will use a combination of HW Raid and HBVS in their environment. The advantage is that when a drive fails in a HW raid device that looks like one volume to HBVS, the drive can be replaced and rebuilt at the SAN level with zero impact on the host servers/SAN controllers. With battery backed up SAN controllers, you can also implement write back strategies as well for improved write performance in the HW raid volumes.

Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Stephen Hoffman
2017-02-16 15:48:07 UTC
Permalink
Post by Kerry Main
Btw, on a separate thought - lets not forget that because disk is so
cheap now, many sites will use a combination of HW Raid and HBVS in
their environment. The advantage is that when a drive fails in a HW
raid device that looks like one volume to HBVS, the drive can be
replaced and rebuilt at the SAN level with zero impact on the host
servers/SAN controllers. With battery backed up SAN controllers, you
can also implement write back strategies as well for improved write
performance in the HW raid volumes.
We're just not headed that way. I'm increasingly working from and
running apps entirely from memory, not from disk. SSD for faster
storage and faster local journaling, HDD for cheaper and slower
journaling, networking for sharing. As the price of byte-addressable
non-volatile storage drops and as virtual and physical memory
configurations continue to increase, other apps will follow the trend.
I just don't see us replicating via SAN storage sharing through the
file system, and traditional I/O will continue to be relegated to
journaling and bulk storage where that's still needed and where that's
cost-effective. I do see the need for faster host-to-host networking
connections, RDMA and such, and new APIs that better allow this to
happen for applications, yes. Scaling up HBVS as the unit of shared
storage? Not so much. That'll remain good for what it does provide,
though.

As for SSD and HDD and the rest, maybe some interest in refreshing and
integrating the HSM layered product into OpenVMS, but that's not a
product that's seen much use in OpenVMS in the last decades. HSM is
akin to what Apple does with Fusion drives, and what some of the more
expensive SAN controllers provide between their memory and NV caches
and SSD and HDD storage. HSN isn't specific to sharing data across
hosts, though. But I'm sure VSI has more than a few other projects to
work on, and most are far more important than optimizing file storage
across available I/O devices.

As for larger address spaces becoming available, here's the current
virtual and physical memory design for Intel x86-64 servers:
https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf
Populating all 52 bits is going to be really expensive for the next
several years at least, though.
--
Pure Personal Opinion | HoffmanLabs LLC
Kerry Main
2017-02-17 02:09:27 UTC
Permalink
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 16, 2017 10:48 AM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by Kerry Main
Btw, on a separate thought - lets not forget that because disk is so
cheap now, many sites will use a combination of HW Raid and HBVS
in
Post by Kerry Main
their environment. The advantage is that when a drive fails in a HW
raid device that looks like one volume to HBVS, the drive can be
replaced and rebuilt at the SAN level with zero impact on the host
servers/SAN controllers. With battery backed up SAN controllers,
you
Post by Kerry Main
can also implement write back strategies as well for improved write
performance in the HW raid volumes.
We're just not headed that way. I'm increasingly working from and
running apps entirely from memory, not from disk. SSD for faster
storage and faster local journaling, HDD for cheaper and slower
journaling, networking for sharing. As the price of byte-addressable
non-volatile storage drops and as virtual and physical memory
configurations continue to increase, other apps will follow the trend.
I just don't see us replicating via SAN storage sharing through the file
system, and traditional I/O will continue to be relegated to journaling
and bulk storage where that's still needed and where that's cost-
effective. I do see the need for faster host-to-host networking
connections, RDMA and such, and new APIs that better allow this to
happen for applications, yes. Scaling up HBVS as the unit of shared
storage? Not so much. That'll remain good for what it does provide,
though.
So what happens when you have multiple systems that need to share data?

RoCEV2 (RDMA) and Infiniband are good for high bandwidth, low latency inter-node connections, but there is still going to be a need for sharing large amounts of data (TB/PB given single 60TB disk are now available) among large numbers of servers - albeit, fewer, but, much larger servers imho.
As for SSD and HDD and the rest, maybe some interest in refreshing
and integrating the HSM layered product into OpenVMS, but that's not
a
product that's seen much use in OpenVMS in the last decades. HSM is
akin to what Apple does with Fusion drives, and what some of the
more expensive SAN controllers provide between their memory and
NV caches
and SSD and HDD storage. HSN isn't specific to sharing data across
hosts, though. But I'm sure VSI has more than a few other projects to
work on, and most are far more important than optimizing file storage
across available I/O devices.
I have been involved in VMware and new emerging hyper-converged technologies as part of a separate DC project and I have to say, I am very impressed with where VMware clustering is today and where it is headed.

In a nut shell, they are building SAN drivers, network appliance functionality (NSX, FW's etc.) into their host based bare metal hypervisor technologies.

Lets be clear - a bare metal hypervisor is really just another name for an optimized OS that runs on a server and hosts many other Apps in individual OS containers called VM's. Using that scenario, you could position OpenVMS as a bare metal hypervisor that runs many other App's within a single integrated OS instance - thereby addressing the VM sprawl issues. Heck, you can run a full versions of OpenVMS in less than 1GB memory. With TB level non-volatile memory coming, 1GB is a rounding error.

The issue with VMware is that their architecture does nothing to address VM sprawl which is becoming a huge cost, complexity and pricing issue for many Customers. The other negative is that VMware pricing is rapidly increasing as well (following the age old adage "if we are popular, we can charge what we want" trap that IBM, DEC, SUN and more recently MS and Oracle continue to fall into).

Point being - lots of opportunity for differentiation in future post V9+ versions of OpenVMS.
As for larger address spaces becoming available, here's the current
https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf
Populating all 52 bits is going to be really expensive for the next
several years at least, though.
Which imho, supports the future positioning for much fewer, much larger servers with tight cluster coupling and very high bandwidth, low latency compute architectures.

😊


Regards,

Kerry Main
Kerry dot main at starkgaming dot com
David Froble
2017-02-17 03:33:08 UTC
Permalink
Post by Kerry Main
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 16, 2017 10:48 AM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by Kerry Main
Btw, on a separate thought - lets not forget that because disk is so
cheap now, many sites will use a combination of HW Raid and HBVS
in
Post by Kerry Main
their environment. The advantage is that when a drive fails in a HW
raid device that looks like one volume to HBVS, the drive can be
replaced and rebuilt at the SAN level with zero impact on the host
servers/SAN controllers. With battery backed up SAN controllers,
you
Post by Kerry Main
can also implement write back strategies as well for improved write
performance in the HW raid volumes.
We're just not headed that way. I'm increasingly working from and
running apps entirely from memory, not from disk. SSD for faster
storage and faster local journaling, HDD for cheaper and slower
journaling, networking for sharing. As the price of byte-addressable
non-volatile storage drops and as virtual and physical memory
configurations continue to increase, other apps will follow the trend.
I just don't see us replicating via SAN storage sharing through the file
system, and traditional I/O will continue to be relegated to journaling
and bulk storage where that's still needed and where that's cost-
effective. I do see the need for faster host-to-host networking
connections, RDMA and such, and new APIs that better allow this to
happen for applications, yes. Scaling up HBVS as the unit of shared
storage? Not so much. That'll remain good for what it does provide,
though.
So what happens when you have multiple systems that need to share data?
Well, how much slower is it to share data from memory, than disk?

What would be better is shared memory, and that would benefit from the recent
proposal I sent to VSI for enhancements to the DLM.

Replace the SAN with very large, multi-ported NV memory. Do activity right in
memory. Do away with the transfers.

Note, I doubt we could ever totally do away with transfers. For instance, when
you're selecting data for further processing.
Stephen Hoffman
2017-02-17 15:53:21 UTC
Permalink
Post by David Froble
Well, how much slower is it to share data from memory, than disk?
What would be better is shared memory, and that would benefit from the
recent proposal I sent to VSI for enhancements to the DLM.
Replace the SAN with very large, multi-ported NV memory. Do activity
right in memory. Do away with the transfers.
Note, I doubt we could ever totally do away with transfers. For
instance, when you're selecting data for further processing.
Ayup. Or run from memory with replicated local persistent journals
allowing for quick recovery. Run directly from memory for most
processing. More than a few of us do that already, after all.

And yes, the file system provides a nice abstraction for this, but it's
not the only way to do this. Running everything through I/O buses and
SANs is a trade-off, intended for when you don't have sufficiently
reliable servers or (as we're starting to see) persisting memory.
Akin to how many folks see virtual memory as a way to avoid dealing
with the more limited physical memory, though virtual memory does add
some other benefits.

As for the trade-offs here, this all gets down to the usual ACID / BASE
discussions, and what sorts of recovery time and latency requirements
are involved. Sometimes you have to run multiple servers, because you
need the lower latency. Different requirements for different apps, of
course. Persisting data to traditional HDD (and HDD-emulating SSD)
storage works for many apps, and it's a very familiar model for OpenVMS
developers.

Key to any of this — whether redundant servers, or persisting to disks
— is having the ability to get good journals or backups of running
applications of course, and that's not a good place for OpenVMS itself,
though the apps and RMS journaling and databases can help.

But then clustering (as currently implemented) and HBVS (as currently
implemented) ain't all that and a bag of chips, going forward. Not
with what sorts of new hardware we're already seeing in the pipeline.
It'll be good at what it has always been good at, and — for about the
bazillionth time I have to include this or somebody will misinterpret
my intent (again) — I'm not suggesting removing clustering or HBVS
here. Supplanting parts and adding newer approaches, yes.
--
Pure Personal Opinion | HoffmanLabs LLC
Jan-Erik Soderholm
2017-02-17 17:01:30 UTC
Permalink
Post by Stephen Hoffman
Post by David Froble
Well, how much slower is it to share data from memory, than disk?
What would be better is shared memory, and that would benefit from the
recent proposal I sent to VSI for enhancements to the DLM.
Replace the SAN with very large, multi-ported NV memory. Do activity
right in memory. Do away with the transfers.
Note, I doubt we could ever totally do away with transfers. For
instance, when you're selecting data for further processing.
Ayup. Or run from memory with replicated local persistent journals
allowing for quick recovery. Run directly from memory for most
processing. More than a few of us do that already, after all.
Sure. Less then 0.5% of the requests for a database page ends up in
a physical access from the disks. The rest is done in memory. This is
totaly over a 50 day period.
David Froble
2017-02-17 21:55:04 UTC
Permalink
Post by Jan-Erik Soderholm
Post by Stephen Hoffman
Post by David Froble
Well, how much slower is it to share data from memory, than disk?
What would be better is shared memory, and that would benefit from the
recent proposal I sent to VSI for enhancements to the DLM.
Replace the SAN with very large, multi-ported NV memory. Do activity
right in memory. Do away with the transfers.
Note, I doubt we could ever totally do away with transfers. For
instance, when you're selecting data for further processing.
Ayup. Or run from memory with replicated local persistent journals
allowing for quick recovery. Run directly from memory for most
processing. More than a few of us do that already, after all.
Sure. Less then 0.5% of the requests for a database page ends up in
a physical access from the disks. The rest is done in memory. This is
totaly over a 50 day period.
Agreed. That's what makes us so happy with the current systems. The entire
database can be held in cache. Makes it rather quick. But with a write-thru
cache, there is still disk writes. And the cache on one VMS system is, as far
as I know, limited to just that system.

But that wasn't what I was talking about. Cache is just a copy from disk. I'm
talking about the large amount of NV memory being the storage, and having
multiple ports so multiple VMS systems can access it directly. I think that
could be called "next generation".
Stephen Hoffman
2017-02-17 22:08:23 UTC
Permalink
I'm talking about the large amount of NV memory being the storage, and
having multiple ports so multiple VMS systems can access it directly.
I think that could be called "next generation".
Which is where I'm headed, and where I'd like new APIs. Futzing
around with disk-based abstractions for data sharing Gets Old.

Not so sure that most of us will be working on full-blown multi-port
memory configurations, as building bigger shared-memory boxes gets much
more expensive. VSI isn't in the hardware market, either.

RAIS / HBMS on the other hand... Support for local byte-addressable
non-volatile storage if (as?) that becomes available, and RDMA (or
whatever) for remote operations. Preferably with a reasonable-to-use
API, and a way to both allow the updates to free-run when eventual
consistency works for the particular app, and to be able to operate
within a transaction across the updates when consistency is required.
--
Pure Personal Opinion | HoffmanLabs LLC
David Froble
2017-02-18 04:09:04 UTC
Permalink
I'm talking about the large amount of NV memory being the storage, and
having multiple ports so multiple VMS systems can access it directly.
I think that could be called "next generation".
Which is where I'm headed, and where I'd like new APIs. Futzing around
with disk-based abstractions for data sharing Gets Old.
Not so sure that most of us will be working on full-blown multi-port
memory configurations, as building bigger shared-memory boxes gets much
more expensive. VSI isn't in the hardware market, either.
Some people aren't afraid to spend money on their VMS systems, or any system,
for that matter. It depends upon their needs. Look at how the SAN vendors came
about. Where there is a need, there is someone who wants the business.

Perhaps a current SAN vendor would introduce such products. You know once one
does, the current SAN market is on death row.
RAIS / HBMS on the other hand... Support for local byte-addressable
non-volatile storage if (as?) that becomes available, and RDMA (or
whatever) for remote operations. Preferably with a reasonable-to-use
API, and a way to both allow the updates to free-run when eventual
consistency works for the particular app, and to be able to operate
within a transaction across the updates when consistency is required.
Still thinking about the concept.

Not sure what you mean by API. What I'm envisioning is a low level API for
accessing the storage. Then higher level API(s) for more specific things, such
as a particular database, and other uses.

As an example, the QIO(W) used to access disk files, and disks. Actually, there
is a lower level for the disks. And for locking.

Then a database, or anything else, can be looked at as a "file", and used
accordingly. At the higher level(s), applications would access the memory
similar to how it's done now.

One subject is how to allocate the NV memory. Now, you create a file, and the
disk space is allocated, and such. I'm thinking I'd like to think a bit more on
the subject before falling back to that example. Not saying I can think of
anything better. Hoping some good ideas surface.
Stephen Hoffman
2017-02-21 15:58:28 UTC
Permalink
Post by David Froble
Post by Stephen Hoffman
I'm talking about the large amount of NV memory being the storage, and
having multiple ports so multiple VMS systems can access it directly.
I think that could be called "next generation".
Which is where I'm headed, and where I'd like new APIs. Futzing
around with disk-based abstractions for data sharing Gets Old.
Not so sure that most of us will be working on full-blown multi-port
memory configurations, as building bigger shared-memory boxes gets much
more expensive. VSI isn't in the hardware market, either.
Some people aren't afraid to spend money on their VMS systems, or any
system, for that matter. It depends upon their needs. Look at how the
SAN vendors came about. Where there is a need, there is someone who
wants the business.
Perhaps a current SAN vendor would introduce such products. You know
once one does, the current SAN market is on death row.
The business of racks and racks of shelves of HDDs — what we all had to
use to get any sort of decent I/O bandwidth with what we had for
storage with HDDs, and with some very fancy controller firmware to
spread that I/O load — is gone. EMC branched out into products and
services beyond their classic SAN storage offerings, and has now been
acquired by Dell, for instance.
Post by David Froble
Post by Stephen Hoffman
RAIS / HBMS on the other hand... Support for local byte-addressable
non-volatile storage if (as?) that becomes available, and RDMA (or
whatever) for remote operations. Preferably with a reasonable-to-use
API, and a way to both allow the updates to free-run when eventual
consistency works for the particular app, and to be able to operate
within a transaction across the updates when consistency is required.
Still thinking about the concept.
Not sure what you mean by API. What I'm envisioning is a low level API
for accessing the storage. Then higher level API(s) for more specific
things, such as a particular database, and other uses.
As an example, the QIO(W) used to access disk files, and disks.
Actually, there is a lower level for the disks. And for locking.
Then a database, or anything else, can be looked at as a "file", and
used accordingly. At the higher level(s), applications would access
the memory similar to how it's done now.
One subject is how to allocate the NV memory. Now, you create a file,
and the disk space is allocated, and such. I'm thinking I'd like to
think a bit more on the subject before falling back to that example.
Not saying I can think of anything better. Hoping some good ideas
surface.
I'd much rather work with an API abstraction designed from the other
direction; from what the developer needs to do and presently has to
deal with to do that, rather than presenting the constituent components
piecemeal. Presenting up pieces and parts works — when you approach
all the pieces with some consistency, and this consistency was one of
the classic OpenVMS strengths — but presenting functions piecemeal with
a mixture of system services such as $qio[w] and $io_perform[w] — two
of the most arcane calls — and DLM calls — another round of very
powerful and very arcane calls – and a mixture of RTL calls and
not-RTL-calls like the TLS support and who-knows-what layered
products.... gets messy to deal with. There's no good way to drain
this existing mess, either. But there's no good foundation for new
applications to start with, either — there's no clear picture for new
developers, and damned little consistency between all the parts we now
need to integrate and use.

Take clustering, for instance. Great concept. The central
abstraction is the file. That's not the only way to share data, and
that data is not where I really want it — in memory — and I'd really
rather not have to make a round-trip through physical storage and
marshalling and unmarshalling to share that data, and I'd rather not
have to do my own change notifications (via DLM or multicast) when the
data changes. There's all sorts of stuff missing from even the
present-day file-based sharing abstraction underneath clustering, such
as data encryption and authentication and isolation. We're also
moving data around a lot more, particularly on the network. With a
cluster, we're supposedly inside a single security domain, which is an
assumption from the last millennium and one that is looking rather more
problematic now — I have no good way to isolate a file parser or a
network parser for instance, and I really don't entirely trust even
existing OpenVMS tools to be proof against malicious volumes or
malicious files. No good way to bridge clusters, and the LDAP
integration is... bad. Cluster management is most charitably called
an afterthought, too.

How this application sharing and replication all fits together and
where clusters might go, I don't know. But I do know that what we
have with clustering — file-based sharing — is not a particularly easy
product to really use, nor one that clearly sells all that well.
--
Pure Personal Opinion | HoffmanLabs LLC
Stephen Hoffman
2017-02-21 16:14:31 UTC
Permalink
Post by Stephen Hoffman
How this application sharing and replication all fits together and
where clusters might go, I don't know. But I do know that what we
have with clustering — file-based sharing — is not a particularly easy
product to really use, nor one that clearly sells all that well.
ps: apropos the discussion of SANs and APIs...
http://storagemojo.com/2017/02/20/why-isnt-storage-innovation-coming-from-the-storage-industry/
--
Pure Personal Opinion | HoffmanLabs LLC
David Froble
2017-02-17 17:10:13 UTC
Permalink
Post by Stephen Hoffman
Post by David Froble
Well, how much slower is it to share data from memory, than disk?
What would be better is shared memory, and that would benefit from the
recent proposal I sent to VSI for enhancements to the DLM.
Replace the SAN with very large, multi-ported NV memory. Do activity
right in memory. Do away with the transfers.
Note, I doubt we could ever totally do away with transfers. For
instance, when you're selecting data for further processing.
Ayup. Or run from memory with replicated local persistent journals
allowing for quick recovery. Run directly from memory for most
processing. More than a few of us do that already, after all.
And yes, the file system provides a nice abstraction for this, but it's
not the only way to do this. Running everything through I/O buses and
SANs is a trade-off, intended for when you don't have sufficiently
reliable servers or (as we're starting to see) persisting memory. Akin
to how many folks see virtual memory as a way to avoid dealing with the
more limited physical memory, though virtual memory does add some other
benefits.
As for the trade-offs here, this all gets down to the usual ACID / BASE
discussions, and what sorts of recovery time and latency requirements
are involved. Sometimes you have to run multiple servers, because you
need the lower latency. Different requirements for different apps, of
course. Persisting data to traditional HDD (and HDD-emulating SSD)
storage works for many apps, and it's a very familiar model for OpenVMS
developers.
Key to any of this — whether redundant servers, or persisting to disks —
is having the ability to get good journals or backups of running
applications of course, and that's not a good place for OpenVMS itself,
though the apps and RMS journaling and databases can help.
But then clustering (as currently implemented) and HBVS (as currently
implemented) ain't all that and a bag of chips, going forward. Not with
what sorts of new hardware we're already seeing in the pipeline. It'll
be good at what it has always been good at, and — for about the
bazillionth time I have to include this or somebody will misinterpret my
intent (again) — I'm not suggesting removing clustering or HBVS here.
Supplanting parts and adding newer approaches, yes.
Well, there are desires, and there is reality. I don't get out much, so if
there is some method for getting a good backup from a running system, I'm not
aware of such. Remember, it's not just the backup, it's the continuing flow of
data outside the storage system.

Not sure there is an alternative to shutting down the flow of incoming data,
stopping all processing, and taking a backup of the static system, and for other
maintenance. At least for the programs I implement, there is too much chance
for a running program holding some information that might be "out of date" after
some maintenance on the data storage. Like wanting to write a block of data to
a "known" location, but the location has been moved.

It gets rather confusing ....
Jan-Erik Soderholm
2017-02-17 17:19:37 UTC
Permalink
Post by David Froble
Post by Stephen Hoffman
Post by David Froble
Well, how much slower is it to share data from memory, than disk?
What would be better is shared memory, and that would benefit from the
recent proposal I sent to VSI for enhancements to the DLM.
Replace the SAN with very large, multi-ported NV memory. Do activity
right in memory. Do away with the transfers.
Note, I doubt we could ever totally do away with transfers. For
instance, when you're selecting data for further processing.
Ayup. Or run from memory with replicated local persistent journals
allowing for quick recovery. Run directly from memory for most
processing. More than a few of us do that already, after all.
And yes, the file system provides a nice abstraction for this, but it's
not the only way to do this. Running everything through I/O buses and
SANs is a trade-off, intended for when you don't have sufficiently
reliable servers or (as we're starting to see) persisting memory. Akin
to how many folks see virtual memory as a way to avoid dealing with the
more limited physical memory, though virtual memory does add some other
benefits.
As for the trade-offs here, this all gets down to the usual ACID / BASE
discussions, and what sorts of recovery time and latency requirements are
involved. Sometimes you have to run multiple servers, because you need
the lower latency. Different requirements for different apps, of
course. Persisting data to traditional HDD (and HDD-emulating SSD)
storage works for many apps, and it's a very familiar model for OpenVMS
developers.
Key to any of this — whether redundant servers, or persisting to disks —
is having the ability to get good journals or backups of running
applications of course, and that's not a good place for OpenVMS itself,
though the apps and RMS journaling and databases can help.
But then clustering (as currently implemented) and HBVS (as currently
implemented) ain't all that and a bag of chips, going forward. Not with
what sorts of new hardware we're already seeing in the pipeline. It'll
be good at what it has always been good at, and — for about the
bazillionth time I have to include this or somebody will misinterpret my
intent (again) — I'm not suggesting removing clustering or HBVS here.
Supplanting parts and adding newer approaches, yes.
Well, there are desires, and there is reality. I don't get out much, so if
there is some method for getting a good backup from a running system, I'm
not aware of such. Remember, it's not just the backup, it's the continuing
flow of data outside the storage system.
If you are running a RDMS (such as Rdb but Rdb is not unique), you can
always get a consistent backup from a specific timestamp (when the
backup operation was started). Without stopping any processing.

Yes, of course, 10 seconds later there might have been new updates but
that will always be like that no matter how you backup your data. And
your transaction journals will catch that new data anyway.

If you have other releted and important data that is stored outside
of the RDMS, you are on your own, as they say.
Post by David Froble
Not sure there is an alternative to shutting down the flow of incoming
data, stopping all processing, and taking a backup of the static system,
and for other maintenance. At least for the programs I implement, there is
too much chance for a running program holding some information that might
be "out of date" after some maintenance on the data storage. Like wanting
to write a block of data to a "known" location, but the location has been
moved.
It gets rather confusing ....
Stephen Hoffman
2017-02-17 21:17:17 UTC
Permalink
Post by David Froble
Post by Stephen Hoffman
Key to any of this — whether redundant servers, or persisting to disks
— is having the ability to get good journals or backups of running
applications of course, and that's not a good place for OpenVMS itself,
though the apps and RMS journaling and databases can help.
Well, there are desires, and there is reality. I don't get out much,
so if there is some method for getting a good backup from a running
system, I'm not aware of such. Remember, it's not just the backup,
it's the continuing flow of data outside the storage system.
Not sure there is an alternative to shutting down the flow of incoming
data, stopping all processing, and taking a backup of the static
system, and for other maintenance. At least for the programs I
implement, there is too much chance for a running program holding some
information that might be "out of date" after some maintenance on the
data storage. Like wanting to write a block of data to a "known"
location, but the location has been moved.
It gets rather confusing ....
If archival operations are approached based on transactional
processing, a consistent archive becomes obtainable. Various
databases use this approach. If the archival operations are
uncoordinated with what the application is doing — the classic and
normal case when using OpenVMS BACKUP — not so much. Journals are a
useful way to keep track of what's happened, allowing a recovery closer
to the failure. This akin to applying incremental BACKUP savesets to
the last full OpenVMS BACKUP. (RMS journaling, or database
journals.) These abilities of databases are a part of why I'd like
to see better databases integrated into OpenVMS.
--
Pure Personal Opinion | HoffmanLabs LLC
David Froble
2017-02-17 22:05:10 UTC
Permalink
Post by David Froble
Post by Stephen Hoffman
Key to any of this — whether redundant servers, or persisting to
disks — is having the ability to get good journals or backups of
running applications of course, and that's not a good place for
OpenVMS itself, though the apps and RMS journaling and databases can
help.
Well, there are desires, and there is reality. I don't get out much,
so if there is some method for getting a good backup from a running
system, I'm not aware of such. Remember, it's not just the backup,
it's the continuing flow of data outside the storage system.
Not sure there is an alternative to shutting down the flow of incoming
data, stopping all processing, and taking a backup of the static
system, and for other maintenance. At least for the programs I
implement, there is too much chance for a running program holding some
information that might be "out of date" after some maintenance on the
data storage. Like wanting to write a block of data to a "known"
location, but the location has been moved.
It gets rather confusing ....
If archival operations are approached based on transactional processing,
a consistent archive becomes obtainable. Various databases use this
approach. If the archival operations are uncoordinated with what the
application is doing — the classic and normal case when using OpenVMS
BACKUP — not so much. Journals are a useful way to keep track of
what's happened, allowing a recovery closer to the failure. This akin
to applying incremental BACKUP savesets to the last full OpenVMS
BACKUP. (RMS journaling, or database journals.) These abilities of
databases are a part of why I'd like to see better databases integrated
into OpenVMS.
Well, yeah, but if not?

It might be nice to be able to sit back, point fingers, and say "you should have
been using Rdb". Regardless, not everyone is, and therefore their options are
limited.

Might be much cheaper for some people to say, shut down the applications, remove
shadow set members, or just do the BACKUP, and then re-start operations.
Frankly, while I don't get out much, I'm not aware of too many instances where
exactly that can happen without actual problems.

Without some type of journaling, if necessary to go back to the backup, there is
still the activity that happened after the backup that could be lost. It's not
all just about getting some type of backup.

From a practical perspective, I asked, and was told that out of the entire set
of Codis customers, once in the last 25 years was it necessary to restore from
backup. Lot of people can live with that.
Stephen Hoffman
2017-02-18 00:06:47 UTC
Permalink
Post by David Froble
It might be nice to be able to sit back, point fingers, and say "you
should have been using Rdb". Regardless, not everyone is, and
therefore their options are limited.
I've used Oracle Rdb, and it works. Wouldn't be my first choice for a
whole pile of common uses for databases, and also wouldn't be my first
choice for ubiquitous inclusion into OpenVMS.
Post by David Froble
Might be much cheaper for some people to say, shut down the
applications, remove shadow set members, or just do the BACKUP, and
then re-start operations. Frankly, while I don't get out much, I'm not
aware of too many instances where exactly that can happen without
actual problems.
That's the current model. Works great for organizations that have
nightly batch windows or analogous processing and archival intervals,
but those offline windows are getting increasingly scarce.
--
Pure Personal Opinion | HoffmanLabs LLC
Kerry Main
2017-02-18 02:40:33 UTC
Permalink
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 17, 2017 7:07 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
Post by David Froble
It might be nice to be able to sit back, point fingers, and say "you
should have been using Rdb". Regardless, not everyone is, and
therefore their options are limited.
I've used Oracle Rdb, and it works. Wouldn't be my first choice for a
whole pile of common uses for databases, and also wouldn't be my
first choice for ubiquitous inclusion into OpenVMS.
Post by David Froble
Might be much cheaper for some people to say, shut down the
applications, remove shadow set members, or just do the BACKUP,
and
Post by David Froble
then re-start operations. Frankly, while I don't get out much, I'm not
aware of too many instances where exactly that can happen without
actual problems.
That's the current model. Works great for organizations that have
nightly batch windows or analogous processing and archival
intervals,
but those offline windows are getting increasingly scarce.
Just to clarify - Oracle Rdb fully supports online backups and has
done so for decades.


Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Richard Maher
2017-02-18 02:22:21 UTC
Permalink
Post by David Froble
Well, how much slower is it to share data from memory, than disk?
What would be better is shared memory, and that would benefit from the
recent proposal I sent to VSI for enhancements to the DLM.
Oracle Cache-Fusion and passing the data with the lock? I think Oracle
has had it for 15 years?
Post by David Froble
Replace the SAN with very large, multi-ported NV memory. Do activity
right in memory. Do away with the transfers.
Still have the SAN but no disk write required to change ownership of the
lock/updates. Still need REDO/AIJ writes.
Stephen Hoffman
2017-02-17 15:38:32 UTC
Permalink
Post by Kerry Main
-----Original Message-----
Stephen Hoffman via Info-vax
Sent: February 16, 2017 10:48 AM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
So what happens when you have multiple systems that need to share data?
I share the data. I don't necessarily use storage — which lacks an
interprocess change notification mechanism, leaving the dev to do that
— to implement this sharing with add-on code and a patchwork of APIs.
Post by Kerry Main
RoCEV2 (RDMA) and Infiniband are good for high bandwidth, low latency
inter-node connections, but there is still going to be a need for
sharing large amounts of data (TB/PB given single 60TB disk are now
available) among large numbers of servers - albeit, fewer, but, much
larger servers imho.
Ayup. And a traditional disk-based file system with distributed
arbitration — and which lacks a notification mechanism — is not the
only way to share.

I'd prefer better APIs for this whole area, but this stuff can and does
work for sharing, and there are tools — various of which are available
on OpenVMS — that allow this to happen.

Routing shared traffic through external persistent SSD or HDD storage
and the file system — which is what the OpenVMS and cluster model is
based on — is not where I'd want to be going forward, though.

I've previously pointed at tools that allow a distributed persistent
storage, and I expect that VSI can certainly adopt and improve upon
these areas and these capabilities. In particular, I'd hope that VSI
— aiming for 2022 or 2027 — is looking at how to move replication and
the associated replication APIs and capabilities up to the server
level, and away from traditional (HDD, and its SSD replacement) storage
and the file system.
Post by Kerry Main
As for larger address spaces becoming available, here's the current
Which imho, supports the future positioning for much fewer, much larger
servers with tight cluster coupling and very high bandwidth, low
latency compute architectures.
If OpenVMS is to fit in where it was originally intended, and to be a
mainframe-ish solution at a single site, or across sites within a
DT-ish configuration, sure. Except where basic physics and the
occasional outage gets in the way of those approaches and designs, and
in the way of latency requirements of some newer applications. Which
is a case that Stark and any other organization serving a world-wide
market will encounter, BTW. For some apps, lower latencies and better
availability wins. DVCSs are a specific case and a subset of this
distribution, too. More than a few of these cases also involve mobile
clients, and which are the other end of more than a little of what
folks are doing with their servers.
--
Pure Personal Opinion | HoffmanLabs LLC
Hans Bachner
2017-02-15 22:36:17 UTC
Permalink
Post by IanD
Post by Hans Bachner
[snip] >
Then I got to thinking about VMS clusters and while we can have 96 nodes in a cluster, really the maximum number of workable nodes in a cluster that would absolutely guarantee data redundancy from a VMS management perspective is really only 6 nodes, a far cry from 96 (maximum number of disks in a shadow set under 8.4 is 6)
Is 6 nodes really the maximum concrete data redundancy factor in VMS clusters?
I don't see why you would limit the useful number of nodes in a cluster
to the maximum number of shadow set members.
But I didn't :-)
Post by Hans Bachner
the maximum number of workable *nodes* in a cluster
and
Post by IanD
Post by Hans Bachner
Is 6 *nodes* really the maximum
(emphasis mine)
Post by IanD
What I said was as far a data redundancy is concerned
i.e. if i have a 90 node cluster but my data is replicated a maximum of 6 times, then isn't 6 the maximum amount of data redundancy I can have?
I was trying to ascertain if 6 was indeed the maximum or whether there was some other combination I was unaware of 'as far as VMS is concerned (.e. under direct VMS control)
sure it is, but that has nothing to do with the (useful) number of
nodes, it's just the maximum number of redundant copies - which can be
used by far more nodes.
Post by IanD
Post by Hans Bachner
As far as redundancy goes, six is the maximum number of copies of data
you can have with VMS. But then, in today's world a shadow set member
has additional redundancy built into the storage system providing the
LUN for this member.
Hans.
I did mention 'as far as VMS was concerned'. I'm quite aware of storage replication etc
As far as VMS nodes and merely adding nodes to a cluster, if one ignores the data redundancy aspect (which is what I was interested in), then clusters themselves don't get you any process redundancy either, since there is no process failover
You get application availability to be sure but lots of other systems give you that now anyhow
But these "other systems" have a bottleneck in most cases - the "writer
node" for a specific data set/partition/disk. And that's frequently not
even part of the cluster.

Hans.
Kerry Main
2017-02-16 01:42:41 UTC
Permalink
-----Original Message-----
Hans Bachner via Info-vax
Sent: February 15, 2017 5:36 PM
Subject: Re: [Info-vax] OpenVMS Clusters - maximum data
redundancy factors
On Friday, February 10, 2017 at 9:59:11 AM UTC+11, Hans Bachner
Post by Hans Bachner
[snip] >
Then I got to thinking about VMS clusters and while we can have
96
Post by Hans Bachner
nodes in a cluster, really the maximum number of workable nodes
in a
Post by Hans Bachner
cluster that would absolutely guarantee data redundancy from a
VMS
Post by Hans Bachner
management perspective is really only 6 nodes, a far cry from 96
(maximum number of disks in a shadow set under 8.4 is 6)
Is 6 nodes really the maximum concrete data redundancy factor in
VMS clusters?
Post by Hans Bachner
I don't see why you would limit the useful number of nodes in a
cluster to the maximum number of shadow set members.
But I didn't :-)
Post by Hans Bachner
the maximum number of workable *nodes* in a cluster and >>>
Is 6 *nodes* really the maximum (emphasis mine)
What I said was as far a data redundancy is concerned
i.e. if i have a 90 node cluster but my data is replicated a
maximum of
6 times, then isn't 6 the maximum amount of data redundancy I can
have?
I was trying to ascertain if 6 was indeed the maximum or whether
there
was some other combination I was unaware of 'as far as VMS is
concerned (.e. under direct VMS control)
sure it is, but that has nothing to do with the (useful) number of
nodes, it's just the maximum number of redundant copies - which can
be used by far more nodes.
Post by Hans Bachner
As far as redundancy goes, six is the maximum number of copies of
data you can have with VMS. But then, in today's world a shadow
set
Post by Hans Bachner
member has additional redundancy built into the storage system
providing the LUN for this member.
Hans.
I did mention 'as far as VMS was concerned'. I'm quite aware of storage replication etc
As far as VMS nodes and merely adding nodes to a cluster, if one
ignores the data redundancy aspect (which is what I was interested
in), then clusters themselves don't get you any process redundancy
either, since there is no process failover
You get application availability to be sure but lots of other
systems
give you that now anyhow
But these "other systems" have a bottleneck in most cases - the
"writer node" for a specific data set/partition/disk. And that's
frequently not even part of the cluster.
Hans.
Hans - I agree.

Ian - your question is not specific to cluster implementations of
OpenVMS vs other platforms, but rather a question of how a shared disk
cluster architecture (OpenVMS, Linux/GFS, z/OS) compares to a shared
nothing cluster architecture (OpenVMS, UNIX, Linux, Windows, NonStop).

Check out this whitepaper for a good comparison: (there are pros and
con's with each architecture)

http://www.scaledb.com/wp-content/uploads/2015/11/Shared-Nohing-vs-Sha
red-Disk-WP_SDvSN.pdf

http://bit.ly/2dScx9k
"Comparing shared-nothing and shared-disk in benchmarks is analogous
to comparing a dragster and a Porsche. The dragster, like the
hand-tuned shared-nothing database, will beat the Porsche in a
straight quarter mile race. However, the Porsche, like a shared-disk
database, will easily beat the dragster on regular roads. If your
selected benchmark is a quarter mile straightaway that tests all out
speed, like Sysbench, a shared-nothing database will win. However,
shared-disk will perform better in real world environments."

So - do you want a dragster or a Porsche?

Regards,

Kerry Main
Kerry dot main at starkgaming dot com
Stephen Hoffman
2017-02-10 17:35:14 UTC
Permalink
Post by IanD
Is 6 nodes really the maximum concrete data redundancy factor in VMS clusters?
HBVS is RAID. RAID is neither backup, nor is it clustering. Nor is
HBVS particularly integral to application data redundancy.

We're already using applications that replicate data without HBVS.
I'm already starting to see applications that replicate processing at
the server level, too. Increasingly with the data stored in memory,
with local storage for journals and recovery storage for each and with
local RAID for storage and HBVS for the cluster configuration and
management morass and shared config files; what others are using LDAP
for.

Replication at the HDD or SSD does works, but — even with minimerge and
minicopy — it's I/O intensive and with synchronous I/O for completion,
and the HBVS behavior not all that customizable, and you're still
dealing with crashes and backups as an application developer — again
HBVS is RAID and RAID targets disk failures, not data losses around app
failures or system crashes.

Since this is all at least five or ten years out into the future, also
add in non-volatile byte-addressable memory, and where HDD or SSD disks
increasingly become offline or archival storage, or doorstops. If I
can replicate across geographically isolated servers, what do I care
about HBVS? Sure, local RAID is still useful for failing SSDs and HDDs
and as a half-baked and sketchy online backup, but why do I want or
need to replicate blobs of data at the disk sector level, across hosts
and host-to-host-speed links? I'd prefer to replicate — and quite
possibly compressed — just my application data, at most. Store config
data in LDAP, and let the LDAP servers replicate that.

Looking forward, beyond some IP-level load-distribution giblets,
clustering has fairly little to offer for server-level replication,
with DECdtm and RTR and the message queuing tools and LDAP and Kerberos
are probably the closest fit for applications that are looking forward.
Maybe the little-known RMS journaling product, for those using RMS and
not some other database. DLM is certainly handy, but the programming
APIs for some of the most common tasks folks use DLM for — such as
electing primary and fallback server processes — are gnarly at best.
And there's no range-lock mechanism documented within DLM, as has been
mentioned, for those that want to coordinate memory ranges — as
differentiated from ranges within files — across hosts.

TL;DR: to continue to be interesting, clustering will likely receive
some changes well beyond draining the swamp of cluster configuration
and management. HBVS will undoubtedly still have a part in that
future, but not as large and as central a part as it once had. I'd
hope to see VSI migrating toward newer hardware and software and
application and system designs, but that'll be post-port, and it'll
take more than a little time and thought, and it might well break
compatibility with a few existing apps. It'll involve better
integrating OpenVMS with existing distributed services such as LDAP,
too. Different applications have different requirements, too. I'm
certainly only looking at a subset of what's out there.
--
Pure Personal Opinion | HoffmanLabs LLC
Loading...