Discussion:
VUPS.COM relevance for modern CPUs
(too old to reply)
Mark Daniel
2022-12-16 11:57:30 UTC
Permalink
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".

It states as much in the prologue:

|$! Provides an estimate of system CPU performance on OpenVMS systems.
|$! Use at your own risk.

To clarify the platform of a particular run I added:

|$ write sys$output f$fao("!AS with !UL CPU and !ULMB running VMS !AS",-
|f$edit(f$getsyi("hw_name"),"compress,trim"),-
|f$getsyi("availcpu_cnt"),-
|(f$getsyi("memsize")*(f$getsyi("page_size")/512)/2048),-
|f$edit(f$getsyi("version"),"collapse"))

which provides the likes of:

|HP rx2660 (1.40GHz/6.0MB) with 4 CPU and 14335MB running VMS V8.4-2L1
|Digital Personal WorkStation with 1 CPU and 1536MB running VMS V8.4-2L1
|innotek GmbH VirtualBox with 2 CPU and 7574MB running VMS V9.2

It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).

$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ 10$:
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")

|Combined for 2 CPUs 0 50 100 150 200
| Interrupt State | |
| MP Synchronization | |
| Kernel Mode 21 |▒▒▒▒ |
| Executive Mode 21 |▒▒▒▒ |
| Supervisor Mode 58 |▒▒▒▒▒▒▒▒▒▒▒ |
| User Mode | |
| Compatibility Mode | |
| Idle Time 99 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |

100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below). There is no USER mode displayed in either.

|Combined for 4 CPUs 0 100 200 300 400
| Interrupt State 1 | |
| MP Synchronization | |
| Kernel Mode 5 | |
| Executive Mode 18 |▒ |
| Supervisor Mode 78 |▒▒▒▒▒▒▒ |
| User Mode | |
| Not Available | |
| Idle Time 299 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |

There appear to be (at least) two versions of these procedures. The
later contains:

|$! Modified: MAY-2010: Code updated by Volker Halle to address the
|$! following issues:

and tweaks a few of the calculations.

There also appear to be earlier tweaks allowing for Alpha processors

|$ cpu_multiplier = 10 ! VAX = 10 - Alpha/AXP = 40
|$ cpu_round_add = 1 ! VAX = 1 - Alpha/AXP = 9

but none for Itanium.

Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?

Are further tweaks required to make measurements on Itania relevant?

And of course the same question for the successor to all three
architectures?
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Volker Halle
2022-12-16 13:24:44 UTC
Permalink
Mark,

I've been using this procedure in the past (and fixed some problems with faster Alpha CPUs), to obtain CPU speed estimates during migration projects to the Stromasys CHARON emulators. In the beginning, those emulators did not perform - CPU-wise - well enough for emulating the faster Alphas (1 GHz or above). Never was a problem for VAX.

To get an estimate of the CPU speed of the customer's system, which should be migrated, getting a plain DCL procedure being executed was the easiest way to get an idea of the CPU performance. Yes, I know, system performance is more than CPU performance, but CPU speed of the early emulators was an expected bottleneck.

Itanium was never a target for running this procedure, as there was/is no Itanium emulator.

The procedure tries to execute a close DCL loop to prevent IOs as much as possible, they would disturbe the CPU speed estimate. This is why you just see Supervisor mode (or higher) and no user mode, as that would require an executable image.

There are certainly better and more exact CPU performance measurement tools, but a DCL procedure was the easiest thing to transfer to the customer's system and be run to provide a CPU speed estimate.

Volker.
Simon Clubley
2022-12-16 13:31:37 UTC
Permalink
Post by Mark Daniel
It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).
Unless CPU instructions execute at a different rate in inner modes,
that by itself should make no difference. However, there is one area
in which that could maybe matter for x86-64. See below.
Post by Mark Daniel
$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")
|Combined for 2 CPUs 0 50 100 150 200
| Interrupt State | |
| MP Synchronization | |
| Kernel Mode 21 |???? |
| Executive Mode 21 |???? |
| Supervisor Mode 58 |??????????? |
| User Mode | |
| Compatibility Mode | |
| Idle Time 99 |??????????????????? |
100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below). There is no USER mode displayed in either.
And that's the difference (and what you are seeing is what I would expect).

Don't forget that Executive and Supervisor modes on x86-64 VMS are purely
an illusion and don't forget that effort is expended in Kernel mode to
switch to and from the emulation of those modes.
Post by Mark Daniel
Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?
Are further tweaks required to make measurements on Itania relevant?
And of course the same question for the successor to all three
architectures?
I think you are now seeing the limits of trying to do this in DCL.

Given all the Executive and Supervisor mode overheads in x86-64 VMS,
I think the only sensible solution is to implement this as a user-mode
compiled program so that you really are testing the relative CPU performance.

You are certainly going to see issues when trying to compare Intel with
AMD processors for example (due to the lack of PCID on AMD processors).

I was going to compare VUPS.COM to the BogoMIPS measurement used on Linux,
but there are issues addressed during the calculation of BogoMIPS that
cannot be addressed when trying to do this at DCL level. The following is
a pretty good summary of how BogoMIPS works:

https://en.wikipedia.org/wiki/BogoMips

BTW, there is one very interesting thing mentioned in the above article
that I had not considered until now when it comes to VMS on x86-64:

Does x86-64 VMS implement dynamic frequency scaling or does it run the
CPUs flat out at 100% of maximum speed all the time ?

As a final note, VUPS.COM only tests relative CPU performance. Should there
now be additional testing programs to test I/O subsystem performance ?

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Dave Froble
2022-12-16 18:44:27 UTC
Permalink
Post by Simon Clubley
Post by Mark Daniel
It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).
Unless CPU instructions execute at a different rate in inner modes,
that by itself should make no difference. However, there is one area
in which that could maybe matter for x86-64. See below.
Post by Mark Daniel
$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")
|Combined for 2 CPUs 0 50 100 150 200
| Interrupt State | |
| MP Synchronization | |
| Kernel Mode 21 |???? |
| Executive Mode 21 |???? |
| Supervisor Mode 58 |??????????? |
| User Mode | |
| Compatibility Mode | |
| Idle Time 99 |??????????????????? |
100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below). There is no USER mode displayed in either.
And that's the difference (and what you are seeing is what I would expect).
Don't forget that Executive and Supervisor modes on x86-64 VMS are purely
an illusion and don't forget that effort is expended in Kernel mode to
switch to and from the emulation of those modes.
Post by Mark Daniel
Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?
Are further tweaks required to make measurements on Itania relevant?
And of course the same question for the successor to all three
architectures?
I think you are now seeing the limits of trying to do this in DCL.
Given all the Executive and Supervisor mode overheads in x86-64 VMS,
I think the only sensible solution is to implement this as a user-mode
compiled program so that you really are testing the relative CPU performance.
You are certainly going to see issues when trying to compare Intel with
AMD processors for example (due to the lack of PCID on AMD processors).
I was going to compare VUPS.COM to the BogoMIPS measurement used on Linux,
but there are issues addressed during the calculation of BogoMIPS that
cannot be addressed when trying to do this at DCL level. The following is
https://en.wikipedia.org/wiki/BogoMips
BTW, there is one very interesting thing mentioned in the above article
Does x86-64 VMS implement dynamic frequency scaling or does it run the
CPUs flat out at 100% of maximum speed all the time ?
As a final note, VUPS.COM only tests relative CPU performance. Should there
now be additional testing programs to test I/O subsystem performance ?
Simon.
Of course there can be more specific, and accurate, means of such measurements.
But as Volker mentioned, what was desired was a rough wild ass gestimate of what
one might expect. For example, if the same code run on system "B" was twice as
fast as on system "A", then one might reasonable expect that system "B" would do
more work than system "A". Many times that is adequate.
--
David Froble Tel: 724-529-0450
Dave Froble Enterprises, Inc. E-Mail: ***@tsoft-inc.com
DFE Ultralights, Inc.
170 Grimplin Road
Vanderbilt, PA 15486
chris
2022-12-16 14:11:20 UTC
Permalink
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
The spec.org site used to be the best place for that sort of thing
and where iirc, the original work to define 1 vup was done. Still
running and might be worth looking at. Truth is though, gains have
been incrementally minimal for single core, with multicore
taking up the banner since.

Here, in the days when I used the Tex package, a good rough guide
was the number of source pages compiled per minute. Microvax II,
about 4 ppm, First Sun workstation, about 20 ppm...

Chris
Post by Mark Daniel
|$! Provides an estimate of system CPU performance on OpenVMS systems.
|$! Use at your own risk.
|$ write sys$output f$fao("!AS with !UL CPU and !ULMB running VMS !AS",-
|f$edit(f$getsyi("hw_name"),"compress,trim"),-
|f$getsyi("availcpu_cnt"),-
|(f$getsyi("memsize")*(f$getsyi("page_size")/512)/2048),-
|f$edit(f$getsyi("version"),"collapse"))
|HP rx2660 (1.40GHz/6.0MB) with 4 CPU and 14335MB running VMS V8.4-2L1
|Digital Personal WorkStation with 1 CPU and 1536MB running VMS V8.4-2L1
|innotek GmbH VirtualBox with 2 CPU and 7574MB running VMS V9.2
It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).
$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")
|Combined for  2 CPUs         0         50        100       150      200
| Interrupt State             |                                       |
| MP Synchronization          |                                       |
| Kernel Mode              21 |▒▒▒▒                                   |
| Executive Mode           21 |▒▒▒▒                                   |
| Supervisor Mode          58 |▒▒▒▒▒▒▒▒▒▒▒                            |
| User Mode                   |                                       |
| Compatibility Mode          |                                       |
| Idle Time                99 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒                    |
100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below).  There is no USER mode displayed in
either.
|Combined for  4 CPUs         0         100       200       300      400
| Interrupt State           1 |                                       |
| MP Synchronization          |                                       |
| Kernel Mode               5 |                                       |
| Executive Mode           18 |▒                                      |
| Supervisor Mode          78 |▒▒▒▒▒▒▒                                |
| User Mode                   |                                       |
| Not Available               |                                       |
| Idle Time               299 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒          |
There appear to be (at least) two versions of these procedures.  The
|$! Modified: MAY-2010: Code updated by Volker Halle to address the
and tweaks a few of the calculations.
There also appear to be earlier tweaks allowing for Alpha processors
|$ cpu_multiplier = 10 ! VAX = 10 - Alpha/AXP = 40
|$ cpu_round_add = 1 ! VAX = 1 - Alpha/AXP = 9
but none for Itanium.
Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?
Are further tweaks required to make measurements on Itania relevant?
And of course the same question for the successor to all three
architectures?
Arne Vajhøj
2022-12-16 14:49:22 UTC
Permalink
Post by chris
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
The spec.org site used to be the best place for that sort of thing
and where iirc, the original work to define 1 vup was done. Still
running and might be worth looking at. Truth is though, gains have
been incrementally minimal for single core, with multicore
taking up the banner since.
Here, in the days when I used the Tex package, a good rough guide
was the number of source pages compiled per minute. Microvax II,
about 4 ppm, First Sun workstation, about 20 ppm...
SPEC is diffferent from VUPS.

But VUPS, SPEC 89 and SPEC 92 are all VAX 780 based.

SPEC 95 is SparcStation 10 and SPEC 2000 is Sparc Ultra 10 based.

(all according to old notes I made like 20 years ago)

Arne
chris
2022-12-16 15:07:37 UTC
Permalink
Post by Arne Vajhøj
Post by chris
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
The spec.org site used to be the best place for that sort of thing
and where iirc, the original work to define 1 vup was done. Still
running and might be worth looking at. Truth is though, gains have
been incrementally minimal for single core, with multicore
taking up the banner since.
Here, in the days when I used the Tex package, a good rough guide
was the number of source pages compiled per minute. Microvax II,
about 4 ppm, First Sun workstation, about 20 ppm...
SPEC is diffferent from VUPS.
But VUPS, SPEC 89 and SPEC 92 are all VAX 780 based.
SPEC 95 is SparcStation 10 and SPEC 2000 is Sparc Ultra 10 based.
(all according to old notes I made like 20 years ago)
Arne
I should try looking again. Pretty much bang up to date afaics...

Chris
Arne Vajhøj
2022-12-16 14:44:26 UTC
Permalink
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
Yes.

I think the order of preference would be:
1) measuring the actual application in question
2) measuring a board suite of programs (SPEC style)
3) measuring using a simple thing like this

But even the latter can do if it is only the magnitude that is
relevant.
Post by Mark Daniel
It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).
$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")
|Combined for  2 CPUs         0         50        100       150      200
| Interrupt State             |                                       |
| MP Synchronization          |                                       |
| Kernel Mode              21 |▒▒▒▒                                   |
| Executive Mode           21 |▒▒▒▒                                   |
| Supervisor Mode          58 |▒▒▒▒▒▒▒▒▒▒▒                            |
| User Mode                   |                                       |
| Compatibility Mode          |                                       |
| Idle Time                99 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒                    |
100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below).  There is no USER mode displayed in
either.
|Combined for  4 CPUs         0         100       200       300      400
| Interrupt State           1 |                                       |
| MP Synchronization          |                                       |
| Kernel Mode               5 |                                       |
| Executive Mode           18 |▒                                      |
| Supervisor Mode          78 |▒▒▒▒▒▒▒                                |
| User Mode                   |                                       |
| Not Available               |                                       |
| Idle Time               299 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒          |
Given that DCL runs in supervisor mode not user mode then supervisor
mode percentage being high and user mode being zero does not surprise
me.

I don't know why kernel and executive mode is higher on x86-64 than on
Itanium. But one possible explanation could be that kernel and executive
mode time is partly fixed while supervisor mode time (time spent
interpreting the execution of the loop) is strictly related to CPU
speed - which means higher K+E percentage and lower S percentage
on faster CPU's.

But a bit outside my area of expertise.
Post by Mark Daniel
There also appear to be earlier tweaks allowing for Alpha processors
|$ cpu_multiplier = 10 ! VAX = 10 - Alpha/AXP = 40
|$ cpu_round_add = 1 ! VAX = 1 - Alpha/AXP = 9
but none for Itanium.
Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?
Are further tweaks required to make measurements on Itania relevant?
And of course the same question for the successor to all three
architectures?
One would need to look at how those constants are used.

But if we assume that they make the test do more iterations
to provide a better test on faster systems (that is quite
common in speed tests), then I would expect Itanium
constants to be higher than Alpha constants and x86-64
constants to be higher than Itanium constants.

If they relate to something 32 bit vs 64 bit then the
previous would be totally wrong, but DCL is 32 bit math
AFAIK.

Arne
Stephen Hoffman
2022-12-16 20:13:55 UTC
Permalink
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
Around the introduction of Alpha, DEC punted on VUPS.

VUPS wasn't particularly representative across VAX. With Alpha, less so.

There was a while where SPEC was occasionally used, and IIRC
occasionally even some TPC benchmarks for database-related, but tended
to misrepresent system performance.

This was all part of the genesis of the DEC customer benchmarking and
test lab at DEC ZKO Nashua; test the actual customer apps on the actual
servers.

For those that do want pictures, some performance charts:
http://www.cs.columbia.edu/~martha/courses/3827/au14/advanced-topics.pdf
http://www.jcmit.net/cpu-performance.htm
http://www.roylongbottom.org.uk/whetstone.htm

And then there's this:
Raspberry Pi 2 at 1 GHz offers 4,744 Dhrystone MIPS, as compared with a
VAX-11/780 offering a single, solitary MIP.
--
Pure Personal Opinion | HoffmanLabs LLC
Mark Daniel
2022-12-16 23:43:05 UTC
Permalink
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
8< snip 8<

Thanks to all those who contributed to this thread.

The followup that caught my eye was from Simon Clubley who provided an
entirely convincing explanation for the elevated X86 Kernel mode.

Also his pointer to BogoMips. Most interesting. I read the FAQ and
accessed the github code. Quite straighforward. Might be a good
replacement as a general performance metric.

https://github.com/vitalyvch/Bogo/blob/BogoMIPS_v1.3/bogomips.c

So taking the example general use implementation, modified it for
building on VMS (and added some bits I find useful).

The only bit I am less sure of is the suppression of optimisation. The
above example uses a GCC-ism

| for (i = loops; !!(i > 0); --i)
| asm volatile ("" ::: "memory");

which I replaced with

|#pragma optimize save
|#pragma optimize level=0
|/* portable version */
|static void delay(long loops)
|{
| long i;
| for (i = loops; !!(i > 0); --i);
|}
|#pragma optimize restore

Anyway, the results were interesting (to say the least):

|$ mcr []bogomips
|HP rx2660 (1.40GHz/6.0MB) 4 CPUs 14335MB V8.4-2L1
|Calibrating delay loop.. ok - 692.73 BogoMips
|$ @vups
|HP rx2660 (1.40GHz/6.0MB) with 4 CPU and 14335MB running VMS V8.4-2L1
|INFO: Preventing endless loop (10$) on fast CPUs
|Approximate System VUPs Rating : 486.3 ( min: 483.8 max: 488.8 )

**
|$ mcr []bogomips
|Digital Personal WorkStation 1 CPU 1536MB V8.4-2L1
|Calibrating delay loop.. ok - 497.10 BogoMips
|$ @vups
|Digital Personal WorkStation with 1 CPU and 1536MB running VMS V8.4-2L1
|Approximate System VUPs Rating : 150.9 ( min: 149.4 max: 151.8 )

|$ mcr []bogomips
|AlphaServer DS20 500 MHz 2 CPUs 1536MB V8.4-2L2
|Calibrating delay loop.. ok - 488.06 BogoMips
|$ @vups
|AlphaServer DS20 500 MHz with 2 CPU and 1536MB running VMS V8.4-2L2
|Approximate System VUPs Rating : 250.7 ( min: 249.8 max: 251.2 )

|$ mcr []bogomips
|innotek GmbH VirtualBox 2 CPUs 7574MB V9.2
|Calibrating delay loop.. ok - 185.12 BogoMips
|$ @dvups
|innotek GmbH VirtualBox with 2 CPU and 7574MB running VMS V9.2
|Approximate System VUPs Rating : 275.9 ( min: 269.2 max: 286.6 )

For anyone interested my VMS bogomips.c can be found at

https://wasd.vsm.com.au/wasd_tmp/bogomips.c

** Interestingly, the PWS BogoMips matches the same model reported in
https://www.clifton.nl/index.html?bogomips.html
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Mark Daniel
2022-12-19 05:01:41 UTC
Permalink
Post by Mark Daniel
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
8< snip 8<
Thanks to all those who contributed to this thread.
The followup that caught my eye was from Simon Clubley who provided an
entirely convincing explanation for the elevated X86 Kernel mode.
Also his pointer to BogoMips.  Most interesting.  I read the FAQ and
accessed the github code.  Quite straighforward.  Might be a good
replacement as a general performance metric.
https://github.com/vitalyvch/Bogo/blob/BogoMIPS_v1.3/bogomips.c
8< snip 8<

For general VMS comparative usage something more VMS-measuring is
needed. I looked about the 'net and nothing sprang out. I wonder what
VSI are using for metrics on X86 development? Anything lurking in the
DECUS/Freeware repositories I missed?

Anyway, in the absence of anything else, I was thinking about what may
consume "non-productive" VMS cycles (i.e. non-USER mode crunching :-)
and all I could think of were the transitions between USER, EXEC and
KERNEL modes. As required by RMS, $QIO, drivers, etc., etc. No SUPER
modes measured here.

With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.

https://wasd.vsm.com.au/wasd_tmp/bogovups.c

The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.

PS. Looking for ideas, suggestions, criticism(s), etc. here...
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Simon Clubley
2022-12-19 13:23:21 UTC
Permalink
Post by Mark Daniel
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
What numbers did you see ?
Post by Mark Daniel
PS. Looking for ideas, suggestions, criticism(s), etc. here...
The tests still "feel" rather artificial.

My suggested alternative would be to write actual files away to disk
using RMS sequential files and also indexed files. Maybe read them
back as well.

Repeat the sequential files test using direct QIO access and see what
performance difference that gives when you bypass the transition to
executive mode.

(Yes, I know, RMS will add its own fixed overheads, but you will still
be able to see a percentage difference across the various machines you
are testing on and whether x86-64 VMS imposes a much higher performance
overhead.)

One obvious problem is that you will have to address issues round
caching in your tests.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Jan-Erik Söderholm
2022-12-19 14:43:08 UTC
Permalink
Post by Simon Clubley
Post by Mark Daniel
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
What numbers did you see ?
They are in the C file on the link above.
Simon Clubley
2022-12-19 18:42:42 UTC
Permalink
Post by Jan-Erik Söderholm
Post by Simon Clubley
Post by Mark Daniel
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
What numbers did you see ?
They are in the C file on the link above.
Dammit. I somehow managed to miss that in the comments while looking at
the rest of the code. :-(

Sorry.

Thanks, Jan-Erik.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
abrsvc
2022-12-19 14:59:13 UTC
Permalink
Post by Simon Clubley
Post by Mark Daniel
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
What numbers did you see ?
Post by Mark Daniel
PS. Looking for ideas, suggestions, criticism(s), etc. here...
The tests still "feel" rather artificial.
My suggested alternative would be to write actual files away to disk
using RMS sequential files and also indexed files. Maybe read them
back as well.
Repeat the sequential files test using direct QIO access and see what
performance difference that gives when you bypass the transition to
executive mode.
(Yes, I know, RMS will add its own fixed overheads, but you will still
be able to see a percentage difference across the various machines you
are testing on and whether x86-64 VMS imposes a much higher performance
overhead.)
One obvious problem is that you will have to address issues round
caching in your tests.
Simon.
--
Walking destinations on a map are further away than they appear.
Why would you want I/O involved in a measurement of relative CPU power? That makes no sense. The VUPs rating has always been a relative CPU performance test. You can argue whether User mode vs other modes makes sense. I suppose that using Mark's updated program may make sense given that the newer versions of OpenVMS use software for some functions (modes really). I don't believe that this is accurate as it will compare hardware vs software between a few models.

I would be interested in (and will likely test) this new option in the emulated environment. It may be a more consistent comparison in this environment.

Dan
Simon Clubley
2022-12-19 19:10:40 UTC
Permalink
Post by abrsvc
Post by Simon Clubley
Post by Mark Daniel
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
What numbers did you see ?
Post by Mark Daniel
PS. Looking for ideas, suggestions, criticism(s), etc. here...
The tests still "feel" rather artificial.
My suggested alternative would be to write actual files away to disk
using RMS sequential files and also indexed files. Maybe read them
back as well.
Repeat the sequential files test using direct QIO access and see what
performance difference that gives when you bypass the transition to
executive mode.
(Yes, I know, RMS will add its own fixed overheads, but you will still
be able to see a percentage difference across the various machines you
are testing on and whether x86-64 VMS imposes a much higher performance
overhead.)
One obvious problem is that you will have to address issues round
caching in your tests.
Why would you want I/O involved in a measurement of relative CPU power? That makes no sense. The VUPs rating has always been a relative CPU performance test. You can argue whether User mode vs other modes makes sense. I suppose that using Mark's updated program may make sense given that the newer versions of OpenVMS use software for some functions (modes really). I don't believe that this is accurate as it will compare hardware vs software between a few models.
Because Mark was reporting bad performance on x86-64 VMS compared to his
other machines, but are the results artifically bad ?

We already know there is going to be an overhead because of the need to
emulate executive and supervisor modes on x86-64 VMS, but Mark's results
don't cover the elapsed time taken by doing "real" work while in executive
mode.

Now that I have seen the numbers (thanks Jan-Erik :-)), they would appear
to be not only bad in general, but bad even when compared to Alpha (provided
Mark's VirtualBox instance isn't somehow constraining performance and
assuming he is running it on modern x86-64 hardware).

As such, doing real work in executive mode, and then comparing that
performance to direct QIO calls, might give better insights into more
real-world performance results.

Also, to help eliminate differences in disk hardware performance, might
it be a good idea to run those tests on a RAM disk instead of a physical
disk ?

Regarding your comment about testing in an emulated environment, don't
forget that VirtualBox is NOT an emulator, but a virtualisation mechanism.

That means operating systems running under it should be running at close
to native hardware performance speeds and it means you should be testing
the x86-64 performance against physical Alpha machines, not emulated ones.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
abrsvc
2022-12-19 19:19:29 UTC
Permalink
Post by Simon Clubley
Regarding your comment about testing in an emulated environment, don't
forget that VirtualBox is NOT an emulator, but a virtualisation mechanism.
That means operating systems running under it should be running at close
to native hardware performance speeds and it means you should be testing
the x86-64 performance against physical Alpha machines, not emulated ones.
Simon.
I never said that the tests run by Mark were done on an emulated environment. Don't infer things that are not there.

What I suggested was that it would be interesting to see what the use of his "new" option would show on an emulated system. No mention of anything other than that. I still think it would be interesting to see if the correlation of "real" to emulated holds using Mark's test when compared to using the original VUPS procedure.

Dan
Simon Clubley
2022-12-19 19:25:19 UTC
Permalink
Post by abrsvc
Post by Simon Clubley
Regarding your comment about testing in an emulated environment, don't
forget that VirtualBox is NOT an emulator, but a virtualisation mechanism.
That means operating systems running under it should be running at close
to native hardware performance speeds and it means you should be testing
the x86-64 performance against physical Alpha machines, not emulated ones.
Simon.
I never said that the tests run by Mark were done on an emulated environment. Don't infer things that are not there.
What I suggested was that it would be interesting to see what the use of his "new" option would show on an emulated system. No mention of anything other than that. I still think it would be interesting to see if the correlation of "real" to emulated holds using Mark's test when compared to using the original VUPS procedure.
Dan, you are _way_ too quick to read bad things into my comments. :-(

Read the quoted comment above again, and you will see I am talking about
the emulator tests _you_ are planning to do and I was pointing out that
Mark's tests should have been done at somewhere close to his native
hardware speed.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Mark Daniel
2022-12-19 23:06:04 UTC
Permalink
Post by Simon Clubley
Post by abrsvc
Post by Simon Clubley
Post by Mark Daniel
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
What numbers did you see ?
Post by Mark Daniel
PS. Looking for ideas, suggestions, criticism(s), etc. here...
The tests still "feel" rather artificial.
Certainly they are artificial. I really didn't want to write a test
suite, just a simple measuring stick for a quick comparison of
platforms. For performance testing the only real metric is "the
application" or a representative code base.
Post by Simon Clubley
Post by abrsvc
Post by Simon Clubley
My suggested alternative would be to write actual files away to disk
using RMS sequential files and also indexed files. Maybe read them
back as well.
Repeat the sequential files test using direct QIO access and see what
performance difference that gives when you bypass the transition to
executive mode.
(Yes, I know, RMS will add its own fixed overheads, but you will still
be able to see a percentage difference across the various machines you
are testing on and whether x86-64 VMS imposes a much higher performance
overhead.)
One obvious problem is that you will have to address issues round
caching in your tests.
Why would you want I/O involved in a measurement of relative CPU power? That makes no sense. The VUPs rating has always been a relative CPU performance test. You can argue whether User mode vs other modes makes sense. I suppose that using Mark's updated program may make sense given that the newer versions of OpenVMS use software for some functions (modes really). I don't believe that this is accurate as it will compare hardware vs software between a few models.
Agreed. Not for a "simple measuring stick" (to quote myself).

Disagree. It compares *VMS* performance. Not necessarily just CPU
power (which is the thrust of BogoMips) but a combination of
hardware/bus/software.
Post by Simon Clubley
Because Mark was reporting bad performance on x86-64 VMS compared to his
other machines, but are the results artifically bad ?
X86 actually performs quite responsively, so this is a fair question.

My EXEC mode function calling a KERNEL mode function is pretty intense.
Perhaps it would be better calling KERNEL mode every tenth time. Or
some combination such as that. I have no information on the respective
call rates apart from @VUPS.COM which on my Alpha shows

| Executive Mode 18 |▒▒▒▒▒▒▒
| Supervisor Mode 81 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

and on the X86 VM

Combined for 2 CPUs
| Kernel Mode 22 |▒▒▒▒
| Executive Mode 16 |▒▒▒
| Supervisor Mode 62 |▒▒▒▒▒▒▒▒▒▒▒▒
| Idle Time 100 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

Of course there is no SUPER mode in my code.

The numbers certainly support the idea that X86 inner-modes are
"emulated" in some way, mediated by KERNEL mode processing.
Post by Simon Clubley
We already know there is going to be an overhead because of the need to
emulate executive and supervisor modes on x86-64 VMS, but Mark's results
don't cover the elapsed time taken by doing "real" work while in executive
mode.
Quite deliberately. It was intended to measure only the overhead of
inner-mode calls. Is this fair and equitable? Who knows?
Post by Simon Clubley
Now that I have seen the numbers (thanks Jan-Erik :-)), they would appear
to be not only bad in general, but bad even when compared to Alpha (provided
Mark's VirtualBox instance isn't somehow constraining performance and
assuming he is running it on modern x86-64 hardware).
A pre-loved AU$300.00 Dell SFF bought via eBay -- Optiplex 9020,
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz RAM 16.0 GB, Windows 10 Pro
Version 22H2 Installed 17/03/2021, VirtualBox 6.1.40. A nice little
system for the money.
Post by Simon Clubley
As such, doing real work in executive mode, and then comparing that
performance to direct QIO calls, might give better insights into more
real-world performance results.
Also, to help eliminate differences in disk hardware performance, might
it be a good idea to run those tests on a RAM disk instead of a physical
disk ?
Not interested in I/O.

Doesn't meet the project criterion of quick (and dirty) platform
comparison (a la VUPS.COM). Perhaps this approach is fraught with all
sorts of limitations and attention should be redirected to VUPS.COM.
Post by Simon Clubley
Regarding your comment about testing in an emulated environment, don't
forget that VirtualBox is NOT an emulator, but a virtualisation mechanism.
Certainly.
Post by Simon Clubley
That means operating systems running under it should be running at close
to native hardware performance speeds and it means you should be testing
the x86-64 performance against physical Alpha machines, not emulated ones.
As a performance comparison tool, why not?
Post by Simon Clubley
Simon.
Thanks for your (plural) interest.
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Volker Halle
2022-12-20 12:13:07 UTC
Permalink
Mark,

on my Intel i5-9600K @3.7 GHz with 6 Cores, Windows 10 Pro 22H2 with VMware Player 16, running my VUPS.COM reports about 730 VUPS with VSI OpenVMS x86-64 V9.2 with 2 vCPUs.

$ MONI MODE/INT=1/AVERAGE reports about 10% Kernel, 21% Exec and 51% Supervisor mode while running VUPS.COM

Running a simple LOOP.COM

$ i=0
$loop:
$ i=i+1
$ GOTO loop

consumes 22% Kernel, 33% Exec and 45% Supervisor mode

Running the same LOOP.COM procedure on an OpenVMS V8.2 rx2600 1.3 GHz (1 CPU) consumes: 8% Kernel, 31% Exec and 61% Supervisor mode.
Running VUPS.COM on that Itanium reports about 1762 VUPS with 5% Kernel, 17% Exec and 77% Supervisor mode.

The 'CPU work' is done in Supervisor mode in a small loop in VUPS.COM, Exec and Kernel could be considered (necessary) 'overhead' and this 'overhead' is more significant on x86-64.

AFAIK VSI has not yet publicized any performance data and may still be concentrating on function vs. performance.

Volker.
Mark Daniel
2022-12-20 15:51:02 UTC
Permalink
Post by Volker Halle
Mark,
Nice. Perhaps a little more than AU$300.00 :-} Mine a paltry 280 VUPs.

I'm looking forward to retiring my PWS for the Dell (or something
equivalent); 145W x 24hr x 365 days, cf. 15W idle 35W processing.
Post by Volker Halle
$ MONI MODE/INT=1/AVERAGE reports about 10% Kernel, 21% Exec and 51% Supervisor mode while running VUPS.COM
Running a simple LOOP.COM
$ i=0
$ i=i+1
$ GOTO loop
consumes 22% Kernel, 33% Exec and 45% Supervisor mode
Running the same LOOP.COM procedure on an OpenVMS V8.2 rx2600 1.3 GHz (1 CPU) consumes: 8% Kernel, 31% Exec and 61% Supervisor mode.
Running VUPS.COM on that Itanium reports about 1762 VUPS with 5% Kernel, 17% Exec and 77% Supervisor mode.
The 'CPU work' is done in Supervisor mode in a small loop in VUPS.COM, Exec and Kernel could be considered (necessary) 'overhead' and this 'overhead' is more significant on x86-64.
AFAIK VSI has not yet publicized any performance data and may still be concentrating on function vs. performance.
Volker.
Again, thanks to all who contributed. Unfortunately, I thoughtlessly
began this thread at the wrong end of the year. I'll be occupied with
other things for a couple of weeks.
Post by Volker Halle
The 'CPU work' is done in Supervisor mode in a small loop in VUPS.COM, Exec and Kernel could be considered (necessary) 'overhead' and this 'overhead' is more significant on x86-64.
AFAIK VSI has not yet publicized any performance data and may still be concentrating on function vs. performance.
Understand these comments, made by several posters. Not trying to say
anything about the X86 port, per se. Just trying to get a *feel* for
the relative performance/behaviour of the respective platforms
(including X86).

My summary...

The loop described above

$ i=0
$loop:
$ i=i+1
$ GOTO loop

is the essence of VUPS.COM and seems as good a finger-in-the-air as any
other. Just forget the "VUPS" as a unit and compare with other
platforms as the measure.

Using the above DCL and MON MODES/INT=1/AVE, my X86 shows

|Combined for 2 CPUs
| Kernel Mode 26 |▒▒▒▒▒
| Executive Mode 26 |▒▒▒▒▒
| Supervisor Mode 43 |▒▒▒▒▒▒▒▒
| Idle Time 103 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

after a couple of minutes. The PWS 500

| Executive Mode 26 |▒▒▒▒▒▒▒▒▒▒
| Supervisor Mode 70 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
| Idle Time 3 |▒

And the RX2600

|Combined for 4 CPUs
| Executive Mode 28 |▒▒
| Supervisor Mode 61 |▒▒▒▒▒▒
| User Mode 4 |
| Idle Time 299 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

And the reported @VUPS.COM "VUPs" correspond to expected platform
performance.

Approximate System VUPs Rating : 286.6 ( min: 285.4 max: 287.8 )
Approximate System VUPs Rating : 135.8 ( min: 135.8 max: 135.8 )
Approximate System VUPs Rating : 486.3 ( min: 483.8 max: 488.8 )

This also shows the (granted, unoptimised) X86 performance to be
none-too-shoddy, especially for AU$300.00

I withdraw the bogoVUPs.c suggestion.
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Simon Clubley
2022-12-20 14:00:32 UTC
Permalink
Post by Mark Daniel
Post by Simon Clubley
That means operating systems running under it should be running at close
to native hardware performance speeds and it means you should be testing
the x86-64 performance against physical Alpha machines, not emulated ones.
As a performance comparison tool, why not?
Because I was expecting an emulated Alpha to be slower than a real Alpha
so the results Dan gets might not be like-for-like.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
abrsvc
2022-12-20 14:11:17 UTC
Permalink
Post by Simon Clubley
Post by Mark Daniel
Post by Simon Clubley
That means operating systems running under it should be running at close
to native hardware performance speeds and it means you should be testing
the x86-64 performance against physical Alpha machines, not emulated ones.
As a performance comparison tool, why not?
Because I was expecting an emulated Alpha to be slower than a real Alpha
so the results Dan gets might not be like-for-like.
Simon.
--
Walking destinations on a map are further away than they appear.
Using this test for V9.x is not the same as using it to test emulated environments. As stated prior, the current state of compiler code generation is such that performance tests are not relevant yet.

I would expect similar performance levels to real Alpha hardware for emulated Alphas on 3.5Ghz or faster machines (except for some of the larger GS class machines). I am aware of ES40 and ES45 class emulated machines achieving same or better performance than the real Alpha equivalents. CPU speed on par with I/O and network performance better than real Alphas.

Dan
Johnny Billquist
2022-12-20 13:53:24 UTC
Permalink
Post by Simon Clubley
Post by abrsvc
Post by Simon Clubley
Post by Mark Daniel
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
What numbers did you see ?
Post by Mark Daniel
PS. Looking for ideas, suggestions, criticism(s), etc. here...
The tests still "feel" rather artificial.
My suggested alternative would be to write actual files away to disk
using RMS sequential files and also indexed files. Maybe read them
back as well.
Repeat the sequential files test using direct QIO access and see what
performance difference that gives when you bypass the transition to
executive mode.
(Yes, I know, RMS will add its own fixed overheads, but you will still
be able to see a percentage difference across the various machines you
are testing on and whether x86-64 VMS imposes a much higher performance
overhead.)
One obvious problem is that you will have to address issues round
caching in your tests.
Why would you want I/O involved in a measurement of relative CPU power? That makes no sense. The VUPs rating has always been a relative CPU performance test. You can argue whether User mode vs other modes makes sense. I suppose that using Mark's updated program may make sense given that the newer versions of OpenVMS use software for some functions (modes really). I don't believe that this is accurate as it will compare hardware vs software between a few models.
Because Mark was reporting bad performance on x86-64 VMS compared to his
other machines, but are the results artifically bad ?
We already know there is going to be an overhead because of the need to
emulate executive and supervisor modes on x86-64 VMS, but Mark's results
don't cover the elapsed time taken by doing "real" work while in executive
mode.
The overhead would only be something that hits a little at mode
switching. While just executing code, the cost of not having executive
and supervisor in hardware should be zero.

And I agree that doing I/O usually means you won't see much of what the
CPU performance actually looks like.

Johnny
Craig A. Berry
2022-12-19 19:56:36 UTC
Permalink
Post by Mark Daniel
Post by Mark Daniel
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
8< snip 8<
Thanks to all those who contributed to this thread.
The followup that caught my eye was from Simon Clubley who provided an
entirely convincing explanation for the elevated X86 Kernel mode.
Also his pointer to BogoMips.  Most interesting.  I read the FAQ and
accessed the github code.  Quite straighforward.  Might be a good
replacement as a general performance metric.
https://github.com/vitalyvch/Bogo/blob/BogoMIPS_v1.3/bogomips.c
8< snip 8<
For general VMS comparative usage something more VMS-measuring is
needed.  I looked about the 'net and nothing sprang out.  I wonder what
VSI are using for metrics on X86 development?  Anything lurking in the
DECUS/Freeware repositories I missed?
Anyway, in the absence of anything else, I was thinking about what may
consume "non-productive" VMS cycles (i.e. non-USER mode crunching :-)
and all I could think of were the transitions between USER, EXEC and
KERNEL modes.  As required by RMS, $QIO, drivers, etc., etc.  No SUPER
modes measured here.
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that.  It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM.  The rest of the results seem in
line with expectations.
PS.  Looking for ideas, suggestions, criticism(s), etc. here...
It's pretty hard to know what this means, if anything. I'll assume your
VirtualBox host is a speedy recent Intel machine with all the cpu
capabilities VMS prefers? (not an M1 Mac emulating x86 I hope!).

It may be that the cross compiler doesn't have optimizations turned on.
It may be that there is a ton of debugging code in the OS that will
eventually get turned off. It may be mode switching is genuinely slow
on x86.

If the latter is the main culprit, I suspect it will play out the same
way alignment faults did on Itanium. They were horrible for the things
affected, but not everything was affected, and not everything that was
affected was affected equally. And multiple different fixes to
compilers, products, and the OS itself eventually mitigated them to
where they became a non-problem for most people most of the time, but it
took a couple years.

The last time I remember anyone from VSI saying anything about
performance was that they hadn't even started looking at it yet. That
may have been over a year ago, but I suspect they still haven't -- just
too much else to do. The word "performance" does not appear on the roadmap.

Somehow I got the impression that enabling compiler optimizations would
be deferred until after native compilers were available. (I may have
that wrong, possibly from a vague memory of the order in which things
happened for previous ports). Not that one couldn't enable optimizations
in a cross compiler, but just that they are working in priority order,
and having native compilers (and more compilers, such as BASIC) is a
much higher priority than performance at this point. You can't even
port from Alpha to x86 right now without buying an Itanium, and I doubt
very many Alpha customers would consider doing that.
Arne Vajhøj
2022-12-20 00:19:36 UTC
Permalink
Post by Mark Daniel
Post by Mark Daniel
Also his pointer to BogoMips.  Most interesting.  I read the FAQ and
accessed the github code.  Quite straighforward.  Might be a good
replacement as a general performance metric.
https://github.com/vitalyvch/Bogo/blob/BogoMIPS_v1.3/bogomips.c
8< snip 8<
For general VMS comparative usage something more VMS-measuring is
needed.  I looked about the 'net and nothing sprang out.  I wonder what
VSI are using for metrics on X86 development?  Anything lurking in the
DECUS/Freeware repositories I missed?
Anyway, in the absence of anything else, I was thinking about what may
consume "non-productive" VMS cycles (i.e. non-USER mode crunching :-)
and all I could think of were the transitions between USER, EXEC and
KERNEL modes.  As required by RMS, $QIO, drivers, etc., etc.  No SUPER
modes measured here.
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that.  It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM.  The rest of the results seem in
line with expectations.
PS.  Looking for ideas, suggestions, criticism(s), etc. here...
First, then I assume neither your code nor VMS
itself are optimized - I believe John Reagan said that
the cross compilers do not optimize much.

But besides that I am not convinced that the time spent to
do mode switches is a particular relevant test. It should
never be a large enough part of total CPU usage to
matter much.

In general the CPU bottlenecks should be in
user mode. So back to BogoMips or DhryStone/WhetStone
or SPEC or whatever.

If something in VMS should be tested then I think a more
relevant test would be to test what the scheduler can handle.
When does overhead start hurting throughput - 500 threads?
1000 threads? 2000 threads? 4000 threads? 8000 threads?

Arne
Bob
2022-12-20 03:10:08 UTC
Permalink
Years (decades?) ago I grabbed a copy of VUPS.COM out of the startup code for a layered product; Diskeeper IIRC. It's not very large, but I'm not sure if it's proper to share what must have been copyrighted code.

Results didn't exactly match published specs, but they helped fill in the gaps for systems that were never on the old VUPS list.

-Bob
Andreas Gruhl
2022-12-20 13:56:38 UTC
Permalink
Post by Mark Daniel
Post by Mark Daniel
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
8< snip 8<
Thanks to all those who contributed to this thread.
The followup that caught my eye was from Simon Clubley who provided an
entirely convincing explanation for the elevated X86 Kernel mode.
Also his pointer to BogoMips. Most interesting. I read the FAQ and
accessed the github code. Quite straighforward. Might be a good
replacement as a general performance metric.
https://github.com/vitalyvch/Bogo/blob/BogoMIPS_v1.3/bogomips.c
8< snip 8<
For general VMS comparative usage something more VMS-measuring is
needed. I looked about the 'net and nothing sprang out. I wonder what
VSI are using for metrics on X86 development? Anything lurking in the
DECUS/Freeware repositories I missed?
Anyway, in the absence of anything else, I was thinking about what may
consume "non-productive" VMS cycles (i.e. non-USER mode crunching :-)
and all I could think of were the transitions between USER, EXEC and
KERNEL modes. As required by RMS, $QIO, drivers, etc., etc. No SUPER
modes measured here.
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
PS. Looking for ideas, suggestions, criticism(s), etc. here...
--
I ran your program in two of our own environments with the following rsults:

|AlphaServer ES45 Model 1B 4 CPUs 16384MB V8.4-2L2
|Calculating IPS...
|321.000000 cpu-ticks, 3.279087 seconds, 609925873 / second
|Calculating bVUPs...
|1645.000000 cpu-ticks, 16.614445 seconds, 370.8 bVUPs
|9157.000000 cpu-ticks, 91.599084 seconds, 10.8 bVUPs

|HP DL380 2.6 GHz VMware, Inc. 4 CPUs 15868MB V9.2
|Calculating IPS...
|459.000000 cpu-ticks, 4.589954 seconds, 435734205 / second
|Calculating bVUPs...
|7470.000000 cpu-ticks, 74.799252 seconds, 58.3 bVUPs

Sorry to say, but your bVUPs computation is not very useful.
This is because you divide dins [instructions/sec] by dticks [0.01 sec].
Insead of being dimensionless (like a good VUP ought to be)
your result has the physical unit of [ 1/sec²].
My own interpretation of the figures given above is:
X86 integer performance reaches 71% of ES45 (321/459 Ticks)
X86 change mode performance reaches 22% of ES45 (1645/7470 Ticks)

This not bad given the fact, that the C crosscompiler currently attempts no
optimization at all (as I know from an extraordinarily well informed source).
Andreas
chris
2022-12-20 15:39:16 UTC
Permalink
Post by Andreas Gruhl
Post by Mark Daniel
Post by Mark Daniel
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
8< snip 8<
Thanks to all those who contributed to this thread.
The followup that caught my eye was from Simon Clubley who provided an
entirely convincing explanation for the elevated X86 Kernel mode.
Also his pointer to BogoMips. Most interesting. I read the FAQ and
accessed the github code. Quite straighforward. Might be a good
replacement as a general performance metric.
https://github.com/vitalyvch/Bogo/blob/BogoMIPS_v1.3/bogomips.c
8< snip 8<
For general VMS comparative usage something more VMS-measuring is
needed. I looked about the 'net and nothing sprang out. I wonder what
VSI are using for metrics on X86 development? Anything lurking in the
DECUS/Freeware repositories I missed?
Anyway, in the absence of anything else, I was thinking about what may
consume "non-productive" VMS cycles (i.e. non-USER mode crunching :-)
and all I could think of were the transitions between USER, EXEC and
KERNEL modes. As required by RMS, $QIO, drivers, etc., etc. No SUPER
modes measured here.
With this in mind I knocked-up a small program to repeatedly call a
function using $CMEXEC which calls a function using $CMKRNL and that is
that. It measures how much effort is required compared to the simple
USER mode loop and reports it as b[ogo]VUPs.
https://wasd.vsm.com.au/wasd_tmp/bogovups.c
The real disappointment is my X86 VM. The rest of the results seem in
line with expectations.
PS. Looking for ideas, suggestions, criticism(s), etc. here...
--
|AlphaServer ES45 Model 1B 4 CPUs 16384MB V8.4-2L2
|Calculating IPS...
|321.000000 cpu-ticks, 3.279087 seconds, 609925873 / second
|Calculating bVUPs...
|1645.000000 cpu-ticks, 16.614445 seconds, 370.8 bVUPs
|9157.000000 cpu-ticks, 91.599084 seconds, 10.8 bVUPs
|HP DL380 2.6 GHz VMware, Inc. 4 CPUs 15868MB V9.2
|Calculating IPS...
|459.000000 cpu-ticks, 4.589954 seconds, 435734205 / second
|Calculating bVUPs...
|7470.000000 cpu-ticks, 74.799252 seconds, 58.3 bVUPs
Sorry to say, but your bVUPs computation is not very useful.
This is because you divide dins [instructions/sec] by dticks [0.01 sec].
Insead of being dimensionless (like a good VUP ought to be)
your result has the physical unit of [ 1/sec²].
X86 integer performance reaches 71% of ES45 (321/459 Ticks)
X86 change mode performance reaches 22% of ES45 (1645/7470 Ticks)
This not bad given the fact, that the C crosscompiler currently attempts no
optimization at all (as I know from an extraordinarily well informed source).
Andreas
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.

Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...

Chris
abrsvc
2022-12-20 15:50:40 UTC
Permalink
Post by chris
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.
Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...
Chris
The big problem with these standard benchmarks is that some compilers will look for these and insert some "special" optimizations specifically for those benchmarks. You are better served using a homegrown benchmark of some type that more closely reflects your application environment.

Dan
chris
2022-12-20 16:33:08 UTC
Permalink
Post by abrsvc
Post by chris
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.
Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...
Chris
The big problem with these standard benchmarks is that some compilers will look for these and insert some "special" optimizations specifically for those benchmarks. You are better served using a homegrown benchmark of some type that more closely reflects your application environment.
Dan
All the conditions are published, including compiler flags,
which compiler and more. Must be more accurate than a home
grown ad hoc test which ignores so many variables that could
influence the results.

If you want to measure something, use the best and most
accurate tools available...

Chris
abrsvc
2022-12-20 16:43:20 UTC
Permalink
Post by chris
Post by abrsvc
Post by chris
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.
Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...
Chris
The big problem with these standard benchmarks is that some compilers will look for these and insert some "special" optimizations specifically for those benchmarks. You are better served using a homegrown benchmark of some type that more closely reflects your application environment.
Dan
All the conditions are published, including compiler flags,
which compiler and more. Must be more accurate than a home
grown ad hoc test which ignores so many variables that could
influence the results.
If you want to measure something, use the best and most
accurate tools available...
Chris
I will disagree. How many standard benchmarks bear any relevance to an actual application? I suppose you can use them for relative machine performance information, but without knowing how your own application performs relative to those, they are useless. SPEC benchmarks mean little to I/O bound applications. Great, my new machine can perform calculations 10 times as fast. But... the application is bound by disk performance limits, so I see little to nothing for the speed improvement. just one extreme example.
chris
2022-12-20 17:22:35 UTC
Permalink
Post by abrsvc
Post by chris
Post by abrsvc
Post by chris
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.
Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...
Chris
The big problem with these standard benchmarks is that some compilers will look for these and insert some "special" optimizations specifically for those benchmarks. You are better served using a homegrown benchmark of some type that more closely reflects your application environment.
Dan
All the conditions are published, including compiler flags,
which compiler and more. Must be more accurate than a home
grown ad hoc test which ignores so many variables that could
influence the results.
If you want to measure something, use the best and most
accurate tools available...
Chris
I will disagree. How many standard benchmarks bear any relevance to an actual application? I suppose you can use them for relative machine performance information, but without knowing how your own application performs relative to those, they are useless. SPEC benchmarks mean little to I/O bound applications. Great, my new machine can perform calculations 10 times as fast. But... the application is bound by disk performance limits, so I see little to nothing for the speed improvement. just one extreme example.
The spec tests do target various workloads, database, web,
scientific and more, so why not use them ?. I;m sure there
must be other sites that do similar work, though haven't looked
recently.

If you want to find out where the bottlenecks are, you need to
start with single core throughput, to establish a baseline. That
without io, which would otherwise dominate most measurements,
orders of magnitude difference. How can you determine anything
by measuring at vm level only, which means you can have no idea
if it's the cpu, os or vm layer having the most influence ?.

Suspect there is little difference between most server vendors,
since they are all using the same cpu ranges and designs and it will
be some variant of the cpu vendors reference design anyway. Same
for disk and network io, as they are all using common controller chips
and vendors as well. Ever increasing complexity and R&D cost means
that only those like IBM can afford to go their own way. Not worth
the investment otherwise.

What will probably make the most difference is the OS and the
intimate knowhow that allows a vendor to make best use of the
processor, cache size and more. Same for application software
as well, some better than others. Too many variables
really to allow any meaningful results based on ad hoc tests.

So you dream up an ad hoc test and get results, so what does that
tell you comparison to anything else ?...

Chris
Arne Vajhøj
2022-12-21 00:07:41 UTC
Permalink
Post by abrsvc
Post by chris
Post by abrsvc
Post by chris
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.
Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...
The big problem with these standard benchmarks is that some
compilers will look for these and insert some "special"
optimizations specifically for those benchmarks. You are better
served using a homegrown benchmark of some type that more closely
reflects your application environment. >>>
All the conditions are published, including compiler flags,
which compiler and more. Must be more accurate than a home
grown ad hoc test which ignores so many variables that could
influence the results.
If you want to measure something, use the best and most
accurate tools available...
I will disagree. How many standard benchmarks bear any relevance to
an actual application? I suppose you can use them for relative
machine performance information, but without knowing how your own
application performs relative to those, they are useless. SPEC
benchmarks mean little to I/O bound applications. Great, my new
machine can perform calculations 10 times as fast. But... the
application is bound by disk performance limits, so I see little to
nothing for the speed improvement. just one extreme example.
Testing with the actual application instead of an
artificial benchmark is obviously better.

But given how much effort has gone into developing
the modern benchmarks, then they should be better
than a simple homegrown benchmark unless one has a rather
unique context.

Obviously one need to pick the right benchmark. Like:

CPU integer => SPEC CPU SPECint
CPU floating point => SPEC CPU SPECfp
CPU floating point linear algebra => LINPACK
Database OLTP => TPC-C
Database DWH => TPC-H
Java app servers => SPECjEnterprise

If we talk old 1980's benchmarks like Dhrystone/Whetstone, then
it is probably not too much work to come up a homegrown benchmark
as good or better.

Arne




Arne
abrsvc
2022-12-21 12:42:01 UTC
Permalink
Post by Arne Vajhøj
Post by chris
Post by abrsvc
Post by chris
None of this makes much sense. spec.org have been devising cpu tests
for decades and have specialist tests for different workloads. That
includes all the info on compilers and code used. Probably the most
accurate data around and is supported by system and cpu vendors as
well. Too many variables involved, so some sort of level playing
field approach is the only way to get accuracy.
Can be fun devising simple tests, but would never used that as a
basis for purchasing decisions...
The big problem with these standard benchmarks is that some
compilers will look for these and insert some "special"
optimizations specifically for those benchmarks. You are better
served using a homegrown benchmark of some type that more closely
reflects your application environment. >>>
All the conditions are published, including compiler flags,
which compiler and more. Must be more accurate than a home
grown ad hoc test which ignores so many variables that could
influence the results.
If you want to measure something, use the best and most
accurate tools available...
I will disagree. How many standard benchmarks bear any relevance to
an actual application? I suppose you can use them for relative
machine performance information, but without knowing how your own
application performs relative to those, they are useless. SPEC
benchmarks mean little to I/O bound applications. Great, my new
machine can perform calculations 10 times as fast. But... the
application is bound by disk performance limits, so I see little to
nothing for the speed improvement. just one extreme example.
Testing with the actual application instead of an
artificial benchmark is obviously better.
But given how much effort has gone into developing
the modern benchmarks, then they should be better
than a simple homegrown benchmark unless one has a rather
unique context.
CPU integer => SPEC CPU SPECint
CPU floating point => SPEC CPU SPECfp
CPU floating point linear algebra => LINPACK
Database OLTP => TPC-C
Database DWH => TPC-H
Java app servers => SPECjEnterprise
If we talk old 1980's benchmarks like Dhrystone/Whetstone, then
it is probably not too much work to come up a homegrown benchmark
as good or better.
Arne
Arne
Perhaps the point I wsa trying to make has not been clear.

Standard benchmarks can provide raw throughput numbers for certain classes of functions (CPU, raw I/O , database functions, etc.).
But... How these relate to a real application environment is required in order to use these to predict performance of a system. A home grown benchmark is less of a raw performance indicator than a more accurate predictor of the specific application environment for any new hardware. If you know the relationship, then I would guess that industry standard benchmarks are useful. In many cases where I have been involved, no simple correlation could be made. You mileage will vary...
Arne Vajhøj
2022-12-22 00:04:02 UTC
Permalink
Post by abrsvc
Post by Arne Vajhøj
I will disagree. How many standard benchmarks bear any relevance to
an actual application? I suppose you can use them for relative
machine performance information, but without knowing how your own
application performs relative to those, they are useless. SPEC
benchmarks mean little to I/O bound applications. Great, my new
machine can perform calculations 10 times as fast. But... the
application is bound by disk performance limits, so I see little to
nothing for the speed improvement. just one extreme example.
Testing with the actual application instead of an
artificial benchmark is obviously better.
But given how much effort has gone into developing
the modern benchmarks, then they should be better
than a simple homegrown benchmark unless one has a rather
unique context.
CPU integer => SPEC CPU SPECint
CPU floating point => SPEC CPU SPECfp
CPU floating point linear algebra => LINPACK
Database OLTP => TPC-C
Database DWH => TPC-H
Java app servers => SPECjEnterprise
If we talk old 1980's benchmarks like Dhrystone/Whetstone, then
it is probably not too much work to come up a homegrown benchmark
as good or better.
Standard benchmarks can provide raw throughput numbers for certain
classes of functions (CPU, raw I/O , database functions, etc.).
But... How these relate to a real application environment is
required in order to use these to predict performance of a system. A
home grown benchmark is less of a raw performance indicator than a
more accurate predictor of the specific application environment for
any new hardware. If you know the relationship, then I would guess
that industry standard benchmarks are useful. In many cases where I
have been involved, no simple correlation could be made. You mileage
will vary...
If ones application does not match one of the standard benchmarks then
one has to create a custom benchmark.

But creating good benchmarks is not easy. Most quickly put
together custom benchmarks are pretty bad. There is a long
history of bad custom benchmarks producing misleading results.

Arne
John Dallman
2022-12-20 17:32:00 UTC
Permalink
Post by chris
All the conditions are published, including compiler flags,
which compiler and more. Must be more accurate than a home
grown ad hoc test which ignores so many variables that could
influence the results.
If you want to measure something, use the best and most
accurate tools available...
That depends what you are measuring, which depends what you're interested
in. If you're buying a machine to run some regular processing, in a
specific environment, then getting a sample one and running the workload
in your environment gives you a way of finding out if it's fast enough
without a lot of analysis.

Most customers don't have the ability to do the analysis themselves, and
getting a consultant who will do it properly is hard (and expensive).
Many consultants find it easier to just re-edit their last paper to give
a conclusion that's based on what how much hardware they think they can
sell.

John
Bob Gezelter
2022-12-21 13:14:16 UTC
Permalink
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
|$! Provides an estimate of system CPU performance on OpenVMS systems.
|$! Use at your own risk.
|$ write sys$output f$fao("!AS with !UL CPU and !ULMB running VMS !AS",-
|f$edit(f$getsyi("hw_name"),"compress,trim"),-
|f$getsyi("availcpu_cnt"),-
|(f$getsyi("memsize")*(f$getsyi("page_size")/512)/2048),-
|f$edit(f$getsyi("version"),"collapse"))
|HP rx2660 (1.40GHz/6.0MB) with 4 CPU and 14335MB running VMS V8.4-2L1
|Digital Personal WorkStation with 1 CPU and 1536MB running VMS V8.4-2L1
|innotek GmbH VirtualBox with 2 CPU and 7574MB running VMS V9.2
It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).
$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")
|Combined for 2 CPUs 0 50 100 150 200
| Interrupt State | |
| MP Synchronization | |
| Kernel Mode 21 |▒▒▒▒ |
| Executive Mode 21 |▒▒▒▒ |
| Supervisor Mode 58 |▒▒▒▒▒▒▒▒▒▒▒ |
| User Mode | |
| Compatibility Mode | |
| Idle Time 99 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |
100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below). There is no USER mode displayed in either.
|Combined for 4 CPUs 0 100 200 300 400
| Interrupt State 1 | |
| MP Synchronization | |
| Kernel Mode 5 | |
| Executive Mode 18 |▒ |
| Supervisor Mode 78 |▒▒▒▒▒▒▒ |
| User Mode | |
| Not Available | |
| Idle Time 299 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |
There appear to be (at least) two versions of these procedures. The
|$! Modified: MAY-2010: Code updated by Volker Halle to address the
and tweaks a few of the calculations.
There also appear to be earlier tweaks allowing for Alpha processors
|$ cpu_multiplier = 10 ! VAX = 10 - Alpha/AXP = 40
|$ cpu_round_add = 1 ! VAX = 1 - Alpha/AXP = 9
th
but none for Itanium.
Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?
Are further tweaks required to make measurements on Itania relevant?
And of course the same question for the successor to all three
architectures?
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Been a bit busy the past few weeks with various things.

The best quote on benchmarks has been the US Environmental Protection Agency's disclaimer on it's automobile dynamometer-based fuel economy ratings "Your mileage may vary", generally rendered as "YMMV".

When CPU, and for that matter, mass storage devices were simple devices, without pipelines, caches, and the like, one could do simple benchmarks and obtain a useful result.

As far back as the late 1970s, pipelines and caches created a benchmark terrain full of cliffs, sinkholes, and plateaus. Back then, my research team saw benchmarks of the CDC 6600 vs the IBM System/370 Model 168 Submodel 3. Depewaysnding on the benchmark the comparison was a factor of 300%; both ways. In other words, essentially an order of magnitude range.

Toss in three levels of CPU/memory caches, some of which are shared; virtualized mass storage at various levels; and other factors. The sum is that one is that getting a raw benchmark is only the beginning of the journey. Tuning can, and often does, have a very large field to explore.

- Bob Gezelter, http://www.rlgsc.com
David Jones
2022-12-21 21:59:22 UTC
Permalink
I use the old Bytemark benchmark to do crude CPU comparison (old as in
'normalized to a 90 Mhz Pentium'). The results are very sensitive to the compiler
and optimization levels, more so for gcc on X86 than the DEC OVMS compilers
on Alpha. Gcc defaults to no optimization and can improve 3-4 times while VSI
C defaults to a fairly high level but doesn't improve as much over /nooptimize.

The current C cross compiler does no optimization and the result I've seen for
a 1.8 GHz gen 8 Xeon (released 2012) is about what I got for a 617 Mhz DS10
in 2001. Bytemark on my Mac mini (3 ghz i5) shows integer performance 20
times better and floating point 10 times better than the DS10.
pos
2022-12-22 09:44:30 UTC
Permalink
Post by Mark Daniel
Now, before everyone piles on, I understand the procedure provides an
indicative/comparative/finger-in-the-air measurement of the relative
performance of a VMS CPU relative to "the original VAX processor".
|$! Provides an estimate of system CPU performance on OpenVMS systems.
|$! Use at your own risk.
|$ write sys$output f$fao("!AS with !UL CPU and !ULMB running VMS !AS",-
|f$edit(f$getsyi("hw_name"),"compress,trim"),-
|f$getsyi("availcpu_cnt"),-
|(f$getsyi("memsize")*(f$getsyi("page_size")/512)/2048),-
|f$edit(f$getsyi("version"),"collapse"))
|HP rx2660 (1.40GHz/6.0MB) with 4 CPU and 14335MB running VMS V8.4-2L1
|Digital Personal WorkStation with 1 CPU and 1536MB running VMS V8.4-2L1
|innotek GmbH VirtualBox with 2 CPU and 7574MB running VMS V9.2
It seems to be implemented as a tight DCL loop that executes almost
entirely in inner modes (I'm sure Brian can explain why).
$ start_cputime = f$getjpi(0,"CPUTIM")
$ loop_index = 0
$ loop_index = loop_index + 1
$ if loop_index .ne. init_loop_maximum then goto 10$
$ end_cputime = f$getjpi(0,"CPUTIM")
|Combined for 2 CPUs 0 50 100 150 200
| Interrupt State | |
| MP Synchronization | |
| Kernel Mode 21 |▒▒▒▒ |
| Executive Mode 21 |▒▒▒▒ |
| Supervisor Mode 58 |▒▒▒▒▒▒▒▒▒▒▒ |
| User Mode | |
| Compatibility Mode | |
| Idle Time 99 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |
100% (or all-but) of this execution appears to be in inner modes.
Although X86 (above) seems to to have much more Kernel than other
architectures, e.g. IA64 below). There is no USER mode displayed in either.
|Combined for 4 CPUs 0 100 200 300 400
| Interrupt State 1 | |
| MP Synchronization | |
| Kernel Mode 5 | |
| Executive Mode 18 |▒ |
| Supervisor Mode 78 |▒▒▒▒▒▒▒ |
| User Mode | |
| Not Available | |
| Idle Time 299 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |
There appear to be (at least) two versions of these procedures. The
|$! Modified: MAY-2010: Code updated by Volker Halle to address the
and tweaks a few of the calculations.
There also appear to be earlier tweaks allowing for Alpha processors
|$ cpu_multiplier = 10 ! VAX = 10 - Alpha/AXP = 40
|$ cpu_round_add = 1 ! VAX = 1 - Alpha/AXP = 9
but none for Itanium.
Are the Alpha tweaks sufficient to allow relevance for all 64bit CPUs?
Are further tweaks required to make measurements on Itania relevant?
And of course the same question for the successor to all three
architectures?
--
Anyone, who using social-media, forms an opinion regarding anything
other than the relative cuteness of this or that puppy-dog, needs
seriously to examine their critical thinking.
Not sure if people remember these, but the DEC/Compaq Enterprise Capacity Planner had *.DBA files that listed SPEC values for pretty much every vendor at the with normalised data. The Enterprise Capacity Planner was bought out by its engineers in 2001 (as that was the only part of Polycenter (PSPA/PSDC/PSCP) not given to CA in 1997) and today lists most vendors, and CPUs to for SPEC 95 to SPEC 2017. Both spec int and spec rate are included. www.perfcap.com. I still use the ECP today, which was ported to Linux. The data is interesting, performance is getting wider *more cores, not faster (faster cores). The files are user editiable to add future predicted CPU speeds for modelling purposes.
Merry Christmas one and all.
Paul.
Stephen Hoffman
2022-12-26 20:05:12 UTC
Permalink
... The Enterprise Capacity Planner was bought out by its engineers in
2001 (as that was the only part of Polycenter (PSPA/PSDC/PSCP) not
given to CA in 1997) ...
Pedantic: POLYCENTER Software Installation Utility (PCSI) was (also) retained.
--
Pure Personal Opinion | HoffmanLabs LLC
Robert A. Brooks
2022-12-27 03:56:23 UTC
Permalink
Post by Stephen Hoffman
... The Enterprise Capacity Planner was bought out by its engineers in 2001
(as that was the only part of Polycenter (PSPA/PSDC/PSCP) not given to CA in
1997) ...
Pedantic: POLYCENTER Software Installation Utility (PCSI) was (also) retained.
POLYCENTER File Optimizer was retained as well.
--
--- Rob
Loading...