Diagnostics?

Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a
tray of diagnostic paper tapes, including such tests as
confirming that the ADD instruction was working.
Later on in the early microcomputer days there were
diagnostic programs to test the efficacy of memory
chips, galloping and walking ones and zeros comes to mind.
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

Certainly useful to test emulators. I suspect current chips have a lot of
self-checking.

--
Pete

Bob Eager

2020-04-16 19:15:34 UTC

Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a tray of
diagnostic paper tapes, including such tests as confirming that the ADD
instruction was working.
Later on in the early microcomputer days there were diagnostic programs
to test the efficacy of memory chips, galloping and walking ones and
zeros comes to mind.
Has the need for such provisions now gone the way of the dinosaurs,
possibly because of the reliability of VERY large scale integration?

Certainly useful to test emulators. I suspect current chips have a lot
of self-checking.

Every IBM machine (desktop, server) I've ever owned came with diagnostics.

I have a load of HP Microservers here at home, and there is a
comprehensive set of diagnostics in the Flash ROM.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

Michael LeVine

2020-04-16 20:21:26 UTC

Post by Bob Eager

Post by Gareth Evans
Remembering the world of DEC minicomputer from nearly 50 years ago,
it seemed that whenever you bought a computer, you got a tray of
diagnostic paper tapes, including such tests as confirming that the ADD
instruction was working.
Later on in the early microcomputer days there were diagnostic programs
to test the efficacy of memory chips, galloping and walking ones and
zeros comes to mind.
Has the need for such provisions now gone the way of the dinosaurs,
possibly because of the reliability of VERY large scale integration?

Certainly useful to test emulators. I suspect current chips have a lot
of self-checking.

Every IBM machine (desktop, server) I've ever owned came with diagnostics.
I have a load of HP Microservers here at home, and there is a
comprehensive set of diagnostics in the Flash ROM.

Same with my iMac and MacBookAIr.

--
Michael LeVine
***@redshift.com

Politics is the art of looking for trouble,
finding it everywhere,
diagnosing it incorrectly,
and applying the wrong remedies.
Groucho Marx

Scott Lurndal

2020-04-16 20:25:16 UTC

Certainly useful to test emulators. I suspect current chips have a lot of
self-checking.

Given the nanoscale of mondern monolithic devices, current chips have both
BIST and BISR capabilities, in addition to scan chains that allow access
to individual gates through JTAG shift chains for debugging processor issues.

BIST - Built-in Self Test
BISR - Built-in Self Repair

The Burroughs mainframes used scan chains extensivly in the late 70's
and early 80's LSI systems, accessible from the maintenance processor;
very useful for both operating system and hardware debug.

Dennis Boone

2020-04-17 00:41:24 UTC

Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

De

Ahem A Rivet's Shot

2020-04-18 08:22:54 UTC

On Thu, 16 Apr 2020 19:41:24 -0500

Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

We are not aware of any undetected errors in our systems.

It is however possible to calculate the expected undetected error
rate of any system and can usually be found in the manufacturers specs if
you look closely. Generally ECC memory is very good but hard disc CRC
rather less so, which is why filesystem block level checksums, redundancy
and monk-like regular checking are all important if you want data to remain
as you stored it. Even so no matter what you do a mean time to data loss
can be calculated for any storage system - you just want to ensure that it
is much longer than it needs to be and pray that you don't wind up on the
wrong side of the bell curve.

--
Steve O'Hara-Smith | Directable Mirror Arrays
C:\>WIN | A better way to focus the sun
The computer obeys and wins. | licences available see
You lose and Bill collects. | http://www.sohara.org/

Peter Flass

2020-04-18 18:33:29 UTC

Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500

Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

We are not aware of any undetected errors in our systems.

At least, none that you’ve been detected ;-)

--
Pete

J. Clarke

2020-04-18 19:04:08 UTC

On Sat, 18 Apr 2020 11:33:29 -0700, Peter Flass

Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500

Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

We are not aware of any undetected errors in our systems.

At least, none that you’ve been detected ;-)

Incidentally a disk drive will tell you how many errors it has
corrected. And if it gets them regularly it will map out a track. On
a modern drive you're not in trouble until it runs out of spares.

I have a real world experience with a memory error by the way. I set
up a new server, started installing the OS, and it died with a parity
error. So I took it to the place I had bought it and asked them to
fix or replace it. They kept it for a week, and this time no parity
error. So I installed the OS and put the thing into service in a
video store. They were doing manual data entry of a bunch of video
tapes and to their surprise, the names would change after they had
been keyed. I confirmed this. Turns out that the idiot tech had just
pulled the parity jumper instead of replacing the defective chip(s).

While I'm about it I should mention the issue with the first computer
I owned. It was a Heath H89 with 48K of RAM. After a while I got the
upgrade to 64K which came on a board. The result was weird. There
was a range of addresses for which the same data appeared in several
places. I went poking into it and finally found the problem--a
resistor pack on the address lines was in backwards (not my
screw-up--it was a factory-assembled board) which was effectively
tying severa address lines together. I pulled, it and soldered it
back the right way around and the problem went away.

Charlie Gibbs

2020-04-18 19:27:38 UTC

Post by J. Clarke
While I'm about it I should mention the issue with the first computer
I owned. It was a Heath H89 with 48K of RAM. After a while I got the
upgrade to 64K which came on a board. The result was weird. There
was a range of addresses for which the same data appeared in several
places. I went poking into it and finally found the problem--a
resistor pack on the address lines was in backwards (not my
screw-up--it was a factory-assembled board) which was effectively
tying severa address lines together. I pulled, it and soldered it
back the right way around and the problem went away.

My first computer, an IMSAI 8080 which I built from kits, came with
a defective CPU chip. It worked perfectly except for the conditional
return instructions, all of which were taken unconditionally. I went
to the local supplier to get another 8080 chip. The guy behind the
counter shook a new chip out of a tube (imagine, a tube full of CPUs!);
I popped it into my machine and all was well.

Some years later, I got a call from a customer on whose machine (a
Pentium-powered Wintel box) our software was crashing. I spent most
of an afternoon running tests with no luck. The program was running
perfectly in many other sites, and I couldn't find any pattern to the
crashes. Finally, in an attempt to prove to the customer that our
software as OK, I installed it in another of their machines, where
it ran fine. The customer's tech swapped CPU chips between the two
machines, and the production machine started working.

--
/~\ Charlie Gibbs | Microsoft is a dictatorship.
\ / <***@kltpzyxm.invalid> | Apple is a cult.
X I'm really at ac.dekanfrus | Linux is anarchy.
/ \ if you read it the right way. | Pick your poison.

Scott Lurndal

2020-04-19 01:52:51 UTC

Post by J. Clarke
On Sat, 18 Apr 2020 11:33:29 -0700, Peter Flass

Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500

Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

We are not aware of any undetected errors in our systems.

At least, none that youâve been detected ;-)

Incidentally a disk drive will tell you how many errors it has
corrected. And if it gets them regularly it will map out a track. On
a modern drive you're not in trouble until it runs out of spares.

Indeed. On my spinning rust drive:

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST1000DM003-1CH162
Serial Number: 42341244
LU WWN Device Id: 5 000c50 061bvasd1e
Firmware Version: HP34
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Apr 18 18:50:02 2020
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 120 099 006 Pre-fail Always - 243495808
3 Spin_Up_Time 0x0023 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 69
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 078 060 030 Pre-fail Always - 67650010
9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 49785
10 Spin_Retry_Count 0x0033 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 70
180 Unknown_HDD_Attribute 0x002a 100 100 000 Old_age Always - 2094417780
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 092 092 000 Old_age Always - 8
190 Airflow_Temperature_Cel 0x0022 068 054 045 Old_age Always - 32 (Min/Max 16/34)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 28
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 238330
194 Temperature_Celsius 0x0022 032 046 000 Old_age Always - 32 (0 14 0 0 0)
196 Reallocated_Event_Count 0x0032 100 100 036 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

r***@gmail.com

2020-04-19 06:33:45 UTC

Post by J. Clarke
On Sat, 18 Apr 2020 11:33:29 -0700, Peter Flass

Post by Ahem A Rivet's Shot
On Thu, 16 Apr 2020 19:41:24 -0500

Post by Gareth Evans
Has the need for such provisions now gone the way of
the dinosaurs, possibly because of the reliability
of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

We are not aware of any undetected errors in our systems.

At least, none that you’ve been detected ;-)

That's on a par with some memory boards for the PC.
Some el-cheapo manufacturers included a chip to generate
parity for the 8-bit memory chips populating the board.
Never get a parity error that way !!

Bob Eager

2020-04-18 20:32:29 UTC

Post by Ahem A Rivet's Shot

Post by Gareth Evans
Has the need for such provisions now gone the way of the dinosaurs,
possibly because of the reliability of VERY large scale integration?

You don't want to know how many corrected errors your memory system
doesn't tell you about. Or your disk drive. Or for that matter,
probably how many undetected errors just pass silently.

We are not aware of any undetected errors in our systems.

At least, none that you’ve been detected ;-)

Years ago we had a mainframe where we worked. OK, an ICL 2960. The memory
was all ECC, and it corrected and reported single bit errors. The ICL
operating system did *nothing* with those reports. Eventually another bit
failed, and we had a detectable (but unrecoverable) two bit error. I
don't think they ever improved that.

We gave up and moved to a home-brew operating system from Edinburgh
University. It logged all the error reports, and did a daily analysis
with a printout for the site engineer. The report stated which board had
the problem, and which chip to change.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

Terry Kennedy

2020-04-19 01:13:35 UTC