New Horizons tripped up but recovered itself without a nasty spill last week. This event occurred on the afternoon of March 19, precisely 14 months to the day since we launched.
What do I mean by saying that the spacecraft “tripped?” What actually happened was that an uncorrectable memory error was detected in the memory of our primary Command and Data Handling (C&DH) computer, which is the “brains” of New Horizons. Although onboard error detection routines can and did recognize such an error within seconds of its occurrence, the error was so severe (a double-bit error in a single memory word) that there was no definitive way for our error-correction algorithm to unambiguously restore the correct series of 1’s and 0’s in this memory location. (Our memory, like that on many other spacecraft, is encoded such that a single-bit error can both be detected and corrected; a double-bit error can be detected, but there isn’t sufficient information encoded to be sure how to correct it.)
Since a bad word in C&DH memory could invoke an unpredictable spacecraft action, C&DH is programmed to command itself to reboot (and thus restore memory from a boot PROM) whenever such an double-bit error is detected. But whenever C&DH resets, our onboard autonomous fault-detection and -protection system declares an emergency and commands the spacecraft to suspend all current activities and go to a “Safe Mode.”
Given the spacecraft mode and state on the afternoon of March 19, the result was that the bird was spun up from 3-axis control to a stable 5-RPM spin, and its antenna was pointed to Earth and commanded to call home for help. New Horizons also commanded itself to shut off unessential power loads (like the PEPSSI and SWAP instruments, which had been collecting Jupiter magnetotail data) and go to an emergency (low) bit rate for downlink.
In an amazing stroke of luck, the NASA Deep Space Network and our control center at the Johns Hopkins Applied Physics Lab were actually in contact with the spacecraft when this even occurred, so our ground team saw – in real-time – the double-bit error, the resulting C&DH computer reset, and the spacecraft commanding itself to “Go Safe.”
The Go Safe maneuver itself resulted in a temporary loss of contact with our baby, but within about 90 minutes, New Horizons was back in communication with Earth, and shortly thereafter, the ground control team at APL had re-established commanding capability.
This was the first time New Horizons had commanded itself to Go Safe in flight, and both the spacecraft and the APL ground team responded expertly. As a result, we regained spacecraft control quickly, and we were back in a nominal operations configuration – taking science data again – in less than two days.
What actually caused the spacecraft C&DH memory to be corrupted with a two-bit error in a single C&DH address? We’re still trying to determine that, but early indications are it was related to a burst of four bit errors within a short time that may have been due to RTG or natural space-environment radiation. Such multi-event bursts have not been uncommon in flight, but they have only once before resulted in a double-bit error in the Guidance and Control (not the C&DH) processor. The event is less critical in the G&C processor, because the spacecraft can operate through such an event, so no Go Safe was required.
Will such Go Safes happen again? Quite possibly. Can we find a way to better protect against such events so they don’t occur as frequently as they might otherwise have? Maybe, and we’re looking into it. Will the spacecraft take care of itself as it did this time? Our confidence is high that it will – extensive ground testing of the autonomy system and its Go Safe response paid off on March 19, and because of the test of the Go Safe function in flight last Monday, we have even greater confidence in our “autopilot” than we did from ground testing alone.
Of course, no one – and most particularly this mission PI – wanted such an in-flight test of contingency procedures. But New Horizons didn’t ask our permission, and we got our Go Safe test, like it or not. What we learned as a result is that our flight system – both the silicon part in space and the carbon part in Maryland – responded with grace and precision to recover without causing any real injury.
This event is a reminder of the very real risks of space flight and the long journey we have ahead in order to accomplish our goal of reconnoitering the Pluto system at the far end of the planetary frontier. So we proceed with both confidence and a renewed sense of the fact that we are playing for keeps. We are now also back to downlinking Jupiter encounter data; back to taking Jovian magnetotail measurements; and back to preparing to initiate hibernation operations this summer. Onward we go, into the cold, yawning abyss that is the outer solar system, with our eyes, minds and hearts firmly fixed on our goal of a history-making scientific exploration of worlds where no one has gone before!
Well, that’s all I wanted to tell you about this time. I’ll be back with more news in another update in April. In the meantime, keep on exploring, just as we do.
– Alan Stern