Software Breaks the Hardware
- Posted: 4/30/10
- Category: Electronics Software Anecdote
- Topics: Minicomputer Honeywell RBOS
My first programming job was for the American Express Space Bank at the end of the 1960s. “Space Bank” was a relatively short-lived hotel and motel reservation system run by the credit card people.
A mainframe computer, an IBM-360 Model 50, contained the database of lodgings and hosted a network of low-speed teletype-like devices connecting the human reservation agents with the hotels.
But the job of keeping track of all the rooms and also communicating with those hundreds, later thousands, of teletype devices was proving to be too much for the mainframe. The decision was made to off-load the telecommunications and that’s where the group I was in entered the picture.
Prior to that point, I had only written a single program for the company, in an archaic language called COBOL, and I’d hated it. In school we’d written a lot of things for different computers in assembler language. And we’d also studied the hardware and how it worked. My degree was in electronics and the relatively simple gates and latches used in digital computers were a no-brainer for me.
Yes, it is accurate to say I’m a geek. I really do understand digital computers.
So when the opportunity arose to switch to the group taking on this interesting new idea of front-ending a mainframe, I jumped at the chance.
We took some Honeywell minicomputers (DDP-516 and others) and wrote everything from the ground up to take over the teletype communications and interact with the IBM mainframe only when needed.
We called the minicomputer’s software the Red Baron Operating System for reasons that I’ll explain in another installment. But by today’s use of terminology, it really was not an “operating system.” Indeed, it was barely more than a kernel, and a darn small one at that.
The 516 we started with had 16K words with 16 bits per word. Everything was written in assembler language and, with only that small amount of memory available, we had to resort to several “tricks” that would be frowned upon today. For example, if a constant was needed somewhere in the program, it was common to look at the values of the instructions to see if one of them had the needed value and, if so, refer to the location containing that instruction rather than setting aside a whole word just for that constant.
Also, some instructions could be “OR’d together” with other instructions since they used different parts of the computer. That way two instructions could sometimes be done at the same time. The “boot loader” we keyed-in through the front panel was extremely limited in size and used both of these tricks to keep its size to an absolute minimum. (Forty years later I still remember some of that code!)
All the development programs ran on the same machine. During development, we would punch cards with the source code and then load the assembler program onto the machine and feed in the punch cards. The cards were run through twice for one assembly. If you wanted a hard-copy listing, that was a third pass that printed the source code along with the assigned instruction and data addresses.
When everything was ready, that same machine went “on-line” by loading in the object code, starting it and connecting with the mainframe. My part was the mainframe connection and I remember an extraordinary number of frustrating hours spent in debugging until the vendor (finally) produced some adequate documentation for the odd device that cross-connected mainframe and minicomputer.
We had two Honeywell 516s, one on-line and one that was available for development. The development machine was also available to take over should the on-line machine fail.
And when we first went on-line, within a couple of days it did exactly that. It failed.
Back at that time, computers used core memory. Each “core” was a donut-shaped piece of magnetic material and it held one bit of information. Each core could be magnetized with the field going clockwise or counter-clockwise. One direction meant the bit was a “1” while the other direction meant “0”. The computer programs and all their data were stored in an array of these tiny cores – 32,768 times 17 cores.
That extra core, number 17, was for “parity” and helped identify when one of the other cores in a group of 16 broke.
Several days after going on-line, the 516 failed. The stand-by machine was started and placed on-line while the hardware engineer from Honeywell – Stan was his name – came out and ran a diagnostic program. It said there was a parity error in the broken machine. That meant one of the little cores wasn’t working.
Stan called the factory and ordered a replacement.
A few days later the replacement core array arrived and Stan installed it. Stan deemed the machine repaired and ready for use.
And that was lucky because within a handful or two of days, the second machine that had gone on-line to replace the broken one, failed. And again, it was a parity error.
The machines were swapped and another core array ordered.
And a week later, you guessed it, another parity error in the then on-line machine.
We’d had these computers for months while we wrote and tested, for a few hours at a time, the software. They had worked flawlessly during all that time without a single problem.
But now it seemed that, regardless of which machine was on-line, the mere fact of it being on-line guaranteed it would fail within a few days or weeks.
Stan, the Honeywell engineer, said he maintained computers for several other companies in the area and he had never before ever seen a parity error on this model, much less three, and in quick succession.
I remember standing outside his on-site office one day with the computer listing of the RBOS software in my arms. I wanted to ask him a question about some now-forgotten detail but Stan was on the phone with the factory and I had to wait.
“I don’t understand it,” I could hear him saying. “I’ve got three core arrays here on the table in front of me and all of them have a bad memory location.”
“And what’s really weird,” he went on, “is that all three have failed at the exact same memory address: 1534.”
Overhearing this, naturally I looked up that address in the listing.
It was the instruction that the RBOS uses when there’s nothing to do. It was a “jump to self” instruction.
For the next part of the story, there’s some technical detail that’s essential.
Core memory, as I mentioned, is magnetized in one direction or the other to signify a “1” or a “0”.
But to read whether it’s a “1” or a “0”, you have to try and magnitize it to a “0” – if it was previously a “0”, you get a small “bump” of electrical current in what’s called the “sense wire”. But if the core was a “1”, then you get a much bigger “bump” in the sense wire. The electronics figure out which is which by the size of the bump, a “1” or a “0”.
But a side effect is that, “reading” a memory location actually erases it so you have to then go back and write the correct value back in. (That also meant that “read” operations took about twice as long as “write” operations.)
Although made from very high quality materials to exacting standards, the little core donuts were not 100% efficient. To write a bit into one of the donuts, or to read it out and then re-write the value back in, you had to use slightly more energy than the donut would store in its magnetic field. A little bit of energy was lost, and it turned into a tiny amount of heat. And that heat had to go somewhere.
So, back to the story.
There I am standing outside Stan’s office with the computer listing in my hands looking at the “jump to self” instruction and thinking about that tiny amount of excess heat. I knew that, when the computer had no real work to do, it would be “idle” and reading out that same memory location millions and millions of times over several seconds. Each time it did, a little bit of energy turned into heat, and that heat was absorbed by the little core donut.
But when the computer got busy doing something and was no longer idle, the donuts at that “jump to self” memory location weren’t used and they would have time to cool off.
As I’m sure you know, when most physical objects change temperature, they expand or contract.
They flex.
And when things are flexed, they fatigue and, in some cases, they eventually break.
So, I went back to the other software engineers and explained my budding theory of how the software was breaking the hardware by accessing that one memory location millions of times and heating it up, and then doing something else and allowing it to cool off, and the repeated expansion and contractions were eventually fatiguing the donut cores at that one memory address more than any other and causing one of the cores to crack.
That, I explained, is what’s causing the parity errors and, because of that one instruction “jump to self” that is unique to the on-line software, that’s why we’re only seeing that failing in the on-line system.
So I proposed, “Let’s change the idle to a loop of four instructions: no operation, no operation, no operation, and jump back to the first no operation.” (“No operation” is a dummy instruction typically spelled NOP and, as the name indicates, it doesn’t do anything.)
The other engineers were skeptical but intrigued so we made the change and at the next opportunity, put the change into the on-line system but didn’t tell anyone outside our little software group for fear they’d laugh.
But there was never another parity error.
Stan, if you’re out there somewhere reading this, now you know: The software really did break the hardware!