Hacker News new | ask | show | jobs
by thewonderidiot 2492 days ago
Sadly most of what you read by googling these problems is misinformation. It was actually an incredibly sinister systems integration bug, that wasn't well described even at the time.

It wasn't a switch misconfiguration; the Apollo 11 astronauts were flying to the checklist, and did as they had simulated. The Rendezvous Radar switch has three settings -- LGC, AUTO TRACK, and SLEW. In LGC mode, the AGC controls the positioning of the antenna; in AUTO TRACK, the radar automatically tracks the CSM based on return strength; and in SLEW it is automatically positioned.

The trouble came from how the trunnion and shaft angles of the antenna were measured. They used "resolvers", which are sort of like variable transfomers. Resolvers look like motors, and attached to the shaft there are two windings positioned 90 degrees apart from each other. An AC "reference voltage" is applied to an outer winding in the case, and that voltage couples onto the two inner windings with a magnitude proportional to the angle on the shaft. One winding (the "sine" winding) produces an output equal to Vrefsin(theta) and the other ("cosine") winding produces an output equal to Vrefcos(theta), where Vref is the reference voltage and theta is the angle of the shaft. The voltage and phase of both windings can be used to determine exactly what the theta was that produced them.

The circuitry to do this is a bit involved though and lived outside the computer, in a device called the CDU, or Coupling Data Unit. The CDU constantly maintained its own idea of what the angle ("psi") in a digital register. It translated the incoming sine and cosine voltages into a digital representation by mechanizing the equation +-sin(theta-psi) = +-sin(theta)cos(psi) -+ cos(theta)sin(psi). It did so by using the bits of its digital register containing psi to switch on and off resistor dividers that effected cos(psi) and sin(psi) onto the incoming signals, which were then added together with a summing amplifier. The goal of the CDU is to zero this sum; to accomplish this, it "counts" the angle register up or down to reduce the magnitude of the sum. As it counts, switches are changed, which switch out resistors in the circuit, which in turn change cos(psi) and sin(psi) in the above equation. And also, with every other increment, a pulse is transmitted to the AGC to indicate that the angle has changed slightly.

The problem comes in because in addition to the above, the CDU also, for many angles, added to the sum some fraction of the reference voltage directly. This is fine when the switch is in the LGC position; the resolvers are supplied with the same 28V, 800Hz reference voltage that is used inside the CDU. However, when the switch is put in either of the other two positions, the reference voltage for the RR resolvers is switched to an unrelated 15V rail. Critically, this 15V reference has no defined phase relationship with the CDU's 28V 800Hz reference. The phasing is locked in by the exact millisecond at which you power up your subsystems.

So when the switch is changed, the sine and cosine outputs from the resolver are suddenly derived from the 15V reference -- they are much lower before and at a random phase. The CDU doesn't know that this has happened, and still tries to perform the summing as before. However, for many theta/phase relationships, it becomes impossible for the CDU to actually null the sum. In these cases, the CDU becomes "manic", and starts seeking back and forth, frantically changing switches to try to figure out what the angle is, but never succeeding.

This causes a huge flurry of +1 and -1 pulses to the AGC. In order to minimize circuitry, the AGC implemented what was called "unprogrammed" or "cycle-stealing" instructions. The computer only contains a single adder, and adding or subtracting 1 from the current angle requires use of that adder and a memory cycle. Rather than generating a full interrupt, which would require many memory cycles and instructions to handle, the computer simply transparently inserts a single-cycle instruction in between two "programmed" instructions that performs the addition or subtraction. This is totally transparent to software, normally. But with a manic CDU that is incessantly seeking on both RR angles, the AGC receives something close to 12,800 pulses per second, which translates into something around 15% of its total computational time. The landing software had only been designed with a margin of 10% or so.

The 1202s were also a lot less benign than is often reported. They occurred because of the fixed two-second guidance cycle in the landing software. That is, once every two seconds, a job called the SERVICER would start. SERVICER had many tasks during the landing. In order: navigation, guidance, commanding throttle, commanding attitude, and updating displays. With an excessive load as caused by the CDU, new SERVICERs were starting before old ones could finish. Eventually there would be two many old SERVICERs hanging around, and when the time came to start a new one, there would be no slots for new jobs available. When this happened, the EXECUTIVE (job scheduler) would issue a 1201 or 1202 alarm and cause a soft restart of the computer. Every job and task was flushed, and the computer started up fresh, resuming from its last checkpoint. It was essentially a full-on crash and restart, rather than a graceful cancellation of a few jobs. And unlike is often said, the computer wasn't dropping low-priority things; it was failing to complete the most critical job of the landing, the SERVICER.

Luckily, the load was light enough that of the SERVICER's duties, the old SERVICER was usually in the final display updating code when it got preempted by a new SERVICER. This caused times in the descent when the display stopped updating entirely, but the flight proceeded mostly as usual. However, with slightly more load, it was fully possible that the SERVICER could have been preempted in the attitude control portion of the code, or worse yet, the throttle control portion. Since each SERVICER shared the same memory location as the last one (since there was only ever supposed to be one running at a time), this could lead to violent attitude or throttle excursions, which would have certainly called for an abort. Luckily, this didn't happen -- and the flight controllers didn't abort the mission not because 1202s were always safe, but because they didn't understand just how bad it could be, were the load just a tiny bit higher.

3 comments

Could I ask how you know so much about this, or where I can read something more detailed than the usual story that's reported? Thanks.
Many years now of research and simulation of the system (I led the restoration of the computer mentioned in the article). There's not a single place where you can read everything, unfortunately, aside from the comment above. We're planning on making a video on it in the future. But I can cite sources:

CDU theory of operation (starting PDF page 15): http://www.ibiblio.org/apollo/Documents/HSI-208435-003.pdf

CDU coarse module schematic: https://archive.org/stream/apertureCardBox462NARASW_images#p...

Grumman memo (from 1968!) describing the problem, and mentioning it is due to the reference switching to a 15V 800Hz source: https://www.ibiblio.org/apollo/Documents/Memo-GAEC_LMO_541_1...

Excerpt from the LM-8 Systems Handbook showing the reference voltage RR switch wiring: https://i.imgur.com/fMsQ7RI.png

Don Eyles describes the software side best in his book Sunburst and Luminary (which I highly recommend) but he also talks about it in some detail on his website: https://doneyles.com/LM/Tales.html

> Sadly most of what you read by googling these problems is misinformation

Thanks for such a detailed account. Unfortunately, it will be added to the trove of otherwise categorized -information that Google returns.

Thanks also, very much, for the links you included below.

It is insane how many unknowable little variables need to go impossibly right on the journey to land on the moon and come back with that old technology.