Sunday, July 13, 2014

System Ready

In a previous post I described the power requirements of the Cray J916 and the importance of the Central Control Unit (CCU) monitoring for faults. This is rather critical as the machine won't function properly unless these hardware status signals check out.

Back of the IOS and cabling
The back of the I/O Subsystem in the Peripheral Cabinet. This is the cleanest cable management system I've ever seen, but tracing cable paths while re-wiring the system was time consuming.

With the system re-cabled we can power it up and begin testing the hardware. We noticed that the CCU contains rechargeable D size batteries. Yes, we received a donated Cray and batteries were included! They were, of course, very dead. The system had been unplugged for a long time.


CCU batteries
Batteries not included!
We pulled the batteries out and gave them an overnight charge. We then figured that the panel would do something with the AC power off as there are a number of PWR FAULT LEDs and there's no reason to have rechargeable batteries, otherwise. But, there was nothing. It took a couple of days before we realized that we had unplugged the battery pack from the CCU circuit board and forgot to plug it back in... Oops.


Fault lights on the CCU
The CCU fault lights are lit even though the power is off. It is running off of battery backup.
Much better. Several lights come on even though the cabinet circuit breakers are still off. The lamp test button works. The buzzer sounds an alarm.

Lamp test on the CCU
LED and Alarm Test on the CCU panel. There is a periodic chirp from the  piezoelectric buzzer.
Everything checks out and the big green System Ready light on the front panel lights up! This light mirrors that status of the small one on the CCU. You can see it glowing dimly behind the door, just to the lower left of the Cray logo.

Cray System Ready light
The Processing Cabinet with a big green System Ready light.
Photo by Dave Fischer
There's a switch on the back of the CCU to turn the batteries off. They only power the lights for a few hours if the system is turned off. I keep forgetting to turn it off when I'm done for the day.

2 comments:

  1. You mention that you forget to then the batteries off when you "are done for the day"; this implies you may be power cycling the machine daily--something they were not designed for, as a power cycle means a thermal cycle and a thermal cycle includes expansion and contraction of components which result in solder fatigue. Mechanical cycles are also hard on these old SCSI and IPI disks.

    ReplyDelete
    Replies
    1. We typically have a work session about once per week where we power (parts of) the system up for around 8 hours at a time. So, we're not cycling the power every day. Each subsystem has a DC inhibit so we can selectively power on only what we are working on. We haven't done any work on the disks yet, and so those have been left off most of the time since the system arrived. We are installing a newer SCSI array to use in place of the stock drives. We only plan on using the original drives to recover software and to benchmark performance.

      At http://rcsri.org we've been restoring and running early machines for 20 years now and some of our members worked on these machines as field service techs for years before that when the machines were new. We're cautious about how we operate these historic systems. Even for machines from the early 1970s the majority of failures that we've seen are things like power supply capacitors drying out and tape drive capstans physically degrading. In theory you might expect solder joints to fail but in practice we only occasionally see intermittent failures from expansion causing cards with edge connectors to become slightly unseated.

      For the Cray there is a large amount of thermal mass in the backplane and also in each of the modules. The cooling system is very effective and after powering the system down the components are barely warm to the touch. We've been collaborating with our colleagues at http://www.cray-cyber.org who have similarly restored systems and they have not reported any problems of the kind that you describe.

      Delete