Discussion:
Diagnostics?
(too old to reply)
Gareth Evans
2020-04-16 17:26:57 UTC
Permalink
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a
tray of diagnostic paper tapes, including such tests as
confirming that the ADD instruction was working.

Later on in the early microcomputer days there were
diagnostic programs to test the efficacy of memory
chips, galloping and walking ones and zeros comes to mind.

Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
Peter Flass
2020-04-16 19:11:52 UTC
Permalink
Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a
tray of diagnostic paper tapes, including such tests as
confirming that the ADD instruction was working.
Later on in the early microcomputer days there were
diagnostic programs to test the efficacy of memory
chips, galloping and walking ones and zeros comes to mind.
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
Certainly useful to test emulators. I suspect current chips have a lot of
self-checking.
--
Pete
Bob Eager
2020-04-16 19:15:34 UTC
Permalink
Post by Peter Flass
Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a tray of
diagnostic paper tapes, including such tests as confirming that the ADD
instruction was working.
Later on in the early microcomputer days there were diagnostic programs
to test the efficacy of memory chips, galloping and walking ones and
zeros comes to mind.
Has the need for such provisions now gone the way of the dinosaurs,
possibly because of the reliability of VERY large scale integration?
Certainly useful to test emulators. I suspect current chips have a lot
of self-checking.
Every IBM machine (desktop, server) I've ever owned came with diagnostics.

I have a load of HP Microservers here at home, and there is a
comprehensive set of diagnostics in the Flash ROM.
--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org
Michael LeVine
2020-04-16 20:21:26 UTC
Permalink
Post by Bob Eager
Post by Peter Flass
Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a tray of
diagnostic paper tapes, including such tests as confirming that the ADD
instruction was working.
Later on in the early microcomputer days there were diagnostic programs
to test the efficacy of memory chips, galloping and walking ones and
zeros comes to mind.
Has the need for such provisions now gone the way of the dinosaurs,
possibly because of the reliability of VERY large scale integration?
Certainly useful to test emulators. I suspect current chips have a lot
of self-checking.
Every IBM machine (desktop, server) I've ever owned came with diagnostics.
I have a load of HP Microservers here at home, and there is a
comprehensive set of diagnostics in the Flash ROM.
Same with my iMac and MacBookAIr.
--
Michael LeVine
***@redshift.com

Politics is the art of looking for trouble,
finding it everywhere,
diagnosing it incorrectly,
and applying the wrong remedies.
Groucho Marx
Scott Lurndal
2020-04-16 20:25:16 UTC
Permalink
Post by Peter Flass
Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a
tray of diagnostic paper tapes, including such tests as
confirming that the ADD instruction was working.
Later on in the early microcomputer days there were
diagnostic programs to test the efficacy of memory
chips, galloping and walking ones and zeros comes to mind.
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
Certainly useful to test emulators. I suspect current chips have a lot of
self-checking.
Given the nanoscale of mondern monolithic devices, current chips have both
BIST and BISR capabilities, in addition to scan chains that allow access
to individual gates through JTAG shift chains for debugging processor issues.

BIST - Built-in Self Test
BISR - Built-in Self Repair

The Burroughs mainframes used scan chains extensivly in the late 70's
and early 80's LSI systems, accessible from the maintenance processor;
very useful for both operating system and hardware debug.
Dennis Boone
2020-04-17 00:41:24 UTC
Permalink
Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

De
Ahem A Rivet's Shot
2020-04-18 08:22:54 UTC
Permalink
On Thu, 16 Apr 2020 19:41:24 -0500
Post by Dennis Boone
Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.
We are not aware of any undetected errors in our systems.

It is however possible to calculate the expected undetected error
rate of any system and can usually be found in the manufacturers specs if
you look closely. Generally ECC memory is very good but hard disc CRC
rather less so, which is why filesystem block level checksums, redundancy
and monk-like regular checking are all important if you want data to remain
as you stored it. Even so no matter what you do a mean time to data loss
can be calculated for any storage system - you just want to ensure that it
is much longer than it needs to be and pray that you don't wind up on the
wrong side of the bell curve.
--
Steve O'Hara-Smith | Directable Mirror Arrays
C:\>WIN | A better way to focus the sun
The computer obeys and wins. | licences available see
You lose and Bill collects. | http://www.sohara.org/
Peter Flass
2020-04-18 18:33:29 UTC
Permalink
Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500
Post by Dennis Boone
Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.
We are not aware of any undetected errors in our systems.
At least, none that you’ve been detected ;-)
--
Pete
J. Clarke
2020-04-18 19:04:08 UTC
Permalink
On Sat, 18 Apr 2020 11:33:29 -0700, Peter Flass
Post by Peter Flass
Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500
Post by Dennis Boone
Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.
We are not aware of any undetected errors in our systems.
At least, none that you’ve been detected ;-)
Incidentally a disk drive will tell you how many errors it has
corrected. And if it gets them regularly it will map out a track. On
a modern drive you're not in trouble until it runs out of spares.

I have a real world experience with a memory error by the way. I set
up a new server, started installing the OS, and it died with a parity
error. So I took it to the place I had bought it and asked them to
fix or replace it. They kept it for a week, and this time no parity
error. So I installed the OS and put the thing into service in a
video store. They were doing manual data entry of a bunch of video
tapes and to their surprise, the names would change after they had
been keyed. I confirmed this. Turns out that the idiot tech had just
pulled the parity jumper instead of replacing the defective chip(s).

While I'm about it I should mention the issue with the first computer
I owned. It was a Heath H89 with 48K of RAM. After a while I got the
upgrade to 64K which came on a board. The result was weird. There
was a range of addresses for which the same data appeared in several
places. I went poking into it and finally found the problem--a
resistor pack on the address lines was in backwards (not my
screw-up--it was a factory-assembled board) which was effectively
tying severa address lines together. I pulled, it and soldered it
back the right way around and the problem went away.
Charlie Gibbs
2020-04-18 19:27:38 UTC
Permalink
Post by J. Clarke
While I'm about it I should mention the issue with the first computer
I owned. It was a Heath H89 with 48K of RAM. After a while I got the
upgrade to 64K which came on a board. The result was weird. There
was a range of addresses for which the same data appeared in several
places. I went poking into it and finally found the problem--a
resistor pack on the address lines was in backwards (not my
screw-up--it was a factory-assembled board) which was effectively
tying severa address lines together. I pulled, it and soldered it
back the right way around and the problem went away.
My first computer, an IMSAI 8080 which I built from kits, came with
a defective CPU chip. It worked perfectly except for the conditional
return instructions, all of which were taken unconditionally. I went
to the local supplier to get another 8080 chip. The guy behind the
counter shook a new chip out of a tube (imagine, a tube full of CPUs!);
I popped it into my machine and all was well.

Some years later, I got a call from a customer on whose machine (a
Pentium-powered Wintel box) our software was crashing. I spent most
of an afternoon running tests with no luck. The program was running
perfectly in many other sites, and I couldn't find any pattern to the
crashes. Finally, in an attempt to prove to the customer that our
software as OK, I installed it in another of their machines, where
it ran fine. The customer's tech swapped CPU chips between the two
machines, and the production machine started working.
--
/~\ Charlie Gibbs | Microsoft is a dictatorship.
\ / <***@kltpzyxm.invalid> | Apple is a cult.
X I'm really at ac.dekanfrus | Linux is anarchy.
/ \ if you read it the right way. | Pick your poison.
Scott Lurndal
2020-04-19 01:52:51 UTC
Permalink
Post by J. Clarke
On Sat, 18 Apr 2020 11:33:29 -0700, Peter Flass
Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500
Post by Dennis Boone
Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.
We are not aware of any undetected errors in our systems.
At least, none that you’ve been detected ;-)
Incidentally a disk drive will tell you how many errors it has
corrected. And if it gets them regularly it will map out a track. On
a modern drive you're not in trouble until it runs out of spares.
Indeed. On my spinning rust drive:

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST1000DM003-1CH162
Serial Number: 42341244
LU WWN Device Id: 5 000c50 061bvasd1e
Firmware Version: HP34
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Apr 18 18:50:02 2020
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 120 099 006 Pre-fail Always - 243495808
3 Spin_Up_Time 0x0023 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 69
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 078 060 030 Pre-fail Always - 67650010
9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 49785
10 Spin_Retry_Count 0x0033 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 70
180 Unknown_HDD_Attribute 0x002a 100 100 000 Old_age Always - 2094417780
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 092 092 000 Old_age Always - 8
190 Airflow_Temperature_Cel 0x0022 068 054 045 Old_age Always - 32 (Min/Max 16/34)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 28
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 238330
194 Temperature_Celsius 0x0022 032 046 000 Old_age Always - 32 (0 14 0 0 0)
196 Reallocated_Event_Count 0x0032 100 100 036 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
r***@gmail.com
2020-04-19 06:33:45 UTC
Permalink
Post by J. Clarke
On Sat, 18 Apr 2020 11:33:29 -0700, Peter Flass
Post by Peter Flass
Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500
Post by Dennis Boone
Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.
We are not aware of any undetected errors in our systems.
At least, none that you’ve been detected ;-)
Incidentally a disk drive will tell you how many errors it has
corrected. And if it gets them regularly it will map out a track. On
a modern drive you're not in trouble until it runs out of spares.
I have a real world experience with a memory error by the way. I set
up a new server, started installing the OS, and it died with a parity
error. So I took it to the place I had bought it and asked them to
fix or replace it. They kept it for a week, and this time no parity
error. So I installed the OS and put the thing into service in a
video store. They were doing manual data entry of a bunch of video
tapes and to their surprise, the names would change after they had
been keyed. I confirmed this. Turns out that the idiot tech had just
pulled the parity jumper instead of replacing the defective chip(s).
That's on a par with some memory boards for the PC.
Some el-cheapo manufacturers included a chip to generate
parity for the 8-bit memory chips populating the board.
Never get a parity error that way !!
Post by J. Clarke
While I'm about it I should mention the issue with the first computer
I owned. It was a Heath H89 with 48K of RAM. After a while I got the
upgrade to 64K which came on a board. The result was weird. There
was a range of addresses for which the same data appeared in several
places. I went poking into it and finally found the problem--a
resistor pack on the address lines was in backwards (not my
screw-up--it was a factory-assembled board) which was effectively
tying severa address lines together. I pulled, it and soldered it
back the right way around and the problem went away.
Bob Eager
2020-04-18 20:32:29 UTC
Permalink
Post by Peter Flass
Post by Ahem A Rivet's Shot
Post by Dennis Boone
Post by Gareth Evans
Has the need for such provisions now gone the way of the dinosaurs,
possibly because of the reliability of VERY large scale integration?
You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.
We are not aware of any undetected errors in our systems.
At least, none that you’ve been detected ;-)
Years ago we had a mainframe where we worked. OK, an ICL 2960. The memory
was all ECC, and it corrected and reported single bit errors. The ICL
operating system did *nothing* with those reports. Eventually another bit
failed, and we had a detectable (but unrecoverable) two bit error. I
don't think they ever improved that.

We gave up and moved to a home-brew operating system from Edinburgh
University. It logged all the error reports, and did a daily analysis
with a printout for the site engineer. The report stated which board had
the problem, and which chip to change.
--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org
Terry Kennedy
2020-04-19 01:13:35 UTC
Permalink
Commenting on many separate posts...

Back when the only 8080-compatible CPU was the 8080A, processor diagnostics were mostly superfluous (I only say mostly because static damage can cause some odd problems). Once there were compatible processors (8085, Z-80, etc) they became more useful for architecture implementation correctness than finding fault with specific ICs in the field. DEC wrote separate CPU/memory diagnostics for their PDP-11 processors, partially due to implementation differences such as parity control registers but also because the CPUs were different in various subtle ways which were generally not of concern, even to operating system developers.
...
The VAX 8600/8650 had a very interesting scan chain. Each MCA (ECL Macrocell Array) had an associated SIP which collected "interesting" logic signals and put them on a diagnostic bus. I had heard that this was planned to be removed in production systems, but I don't know if that was true. In any event, it shipped in all systems and the diagnostics made good use of it. You could fit 2 snapshots on the 10MB disk drive on the console processor, and the diagnostics would read through a snapshot and suggest the most likely board to need replacing. The 8600 had a very overextended development cycle and wasn't as fast as planned. You can see this by the 8600 -> 8650 upgrade being only 2 boards out of dozens. The first 8600s shipped without the console disk pack, the idea being that by the time the systems arrived at the purchaser, DEC would have had more time to work on the microcode and could overnight packs in time for the startup phase of installation.
...
SAS disk drives are a lot less bashful in revealing their error correction counts. SATA drives with SMART tend to hide those until just (hopefully) before the problem becomes catastrophic.
...
Worse than a parity jumper, once early IBM PC clones started using SIMM memory some memory manufacturers produced modules with "logic parity", which is a fancy way of saying the module calculated parity on whatever was in it and placed that fake parity bit on the memory bus.
...
SECDED - Single Error Correction, Double Error Detection. Very common, even today. The idea is that regular background memory scrubs will detect single-bit errors and correct them (or map in spare RAM) before they grow into double-bit errors. After all, if 2 bits are corrupted in a single word, it is likely that more have also failed and it is time to throw up the white flag and replace the module / board. This is all specified in (for example) the Intel Machine Check Architecture documentation.
Loading...