Copyright Peter G. Neumann, June 1999,
although freely quotable with appropriate acknowledgement.
THIS DOCUMENT includes NEW MATERIAL that supplements the fourth printing of the book, Computer-Related Risks, Addison-Wesley, 1995, ISBN 0-201-55805-X. (An ERRATA LIST for first three printings is available at http://www.csl.sri.com/neumann.html in case you have an earlier printing.)
For the foreseeable future, there will be no second edition. Instead, subsequent printings will simply refer to more recent on-line material. I have edited several Inside Risks articles, and shown how they relate to the existing book chapters. This version includes material through May 1999. Recent material will continue to be added progressively.
For my convenience, if not for yours, the chapter and section numbers are coordinated with the printed book. The indexed page references are for this incremental draft.
I am grateful to Otfried Cheong's Hyperlatex for making this on-line browsable version so easy to produce. Cross-references, references, and index items are all clickable. The unbound crossreferences to existing sections of the printed book will be correctly inserted later. This on-line supplement to the book is intended to be a living document that will continue to grow, and which is intended to supplant the need for a second edition. There is no other way to keep the printed book up-to-date.
The reader is encouraged to look at my on-line one-liner RISKS index,
Illustrative Risks to the Public in the Use of Computers and Related
Systems, which is accessible in
<HREF="http://www.csl.sri.com/neumann/illustrative.html"> html</A>
form http://www.csl.sri.com/neumann/illustrative.html and also in .pdf and
.ps form. Newer references favor the on-line Risks Forum directly, rather
than the less accessible RISKS Sections of the ACM Software Engineering
Notes. The on-line RISKS archives are available with a classy search
engine at <A
HREF="http://catless.ncl.ac.uk/Risks">http://catless.ncl.ac.uk/Risks/</A>.
PREFACE TO THE SECOND EDITION
Much has happened since this book originally went to press. There have been many new instances of old problems previously documented, but relatively few new types of problems. In some cases, the technology has progressed - although in those cases the threats, vulnerabilities, risks, and expectations of system capabilities have also escalated. On the other hand, sociopolitical considerations have not contributed noticeably to any lessening of the risks. Basically, all of the conclusions of the book seem to be just as relevant now - if not more so. Thus, the second edition amplifies and clarifies rather than modifies.
Several recent successes are noted. For example, subsequent to the retrofits to overcome the initial mirror flaw in the Hubble Space Telescope (see Page 204 of the book), further hardware and instrumentation upgrades have enabled some spectacular new discoveries. (However, one of three NICMOS cameras remains out of focus.) Also, recent Space Shuttle missions seem to have had fewer problems than those recorded for the earlier flights - although minor problems have continued, and the Space Station has undergone many difficulties.
Here are just a few specific recent problems of particular note:
There are many additional cases that can be characterized as "more of the same."
A summary of essentially all interesting RISKS cases can be found at ftp://ftp.csl.sri.com/illustrative.PS. My Web site http://www.csl.sri.com/neumann.html contains further information, including my 1996 testimony for the Permanent Subcommittee on Investigations of the Senate Committee on Governmental Affairs [26] on security risks in the infrastructure, analysis of risks relating to the Social Security Administration's PEBES (Personal Earnings and Benefit Estimate Statement) Web site and related identity-related risks, and my 1997 testimony for the Senate Judiciary Committee on risks in key-recovery.
If you wish to catch up with recent events and you are able to browse the Internet, you are encouraged to peruse the RISKS archives - at ftp://ftp.sri.com/risks or http://catless.ncl.ac.uk/Risks/VL.IS.html (where VL is the VoLume number and IS the ISsue number). The ftp.sri.com site directory "risks" and ftp.csl.sri.com both contain the most recent PostScript copy of my comprehensive historical summary of mostly one-line descriptions of the RISKS cases in the file, illustrative.PS. Further instructions for on-line access are given in Appendix . Additional material can also be found in each regular issue of the ACM Software Engineering Notes and in the Inside Risks column in each issue of the Communications of the ACM.
Internet routing black hole. On April 23, 1997, at 11:14 a.m. EDT, Internet service providers lost contact with nearly all of the U.S. Internet backbone operators. As a result, much of the Internet was disconnected, some parts for 20 minutes, some for up to 3 hours. The problem was attributed to MAI Network Services in McLean, Virginia (www.mai.net), which provided Sprint and other backbone providers with incorrect routing tables, the result of which was that MAI was flooded with traffic. In addition, the InterNIC directory incorrectly listed Florida Internet Exchange as the owner of the routing tables. A "technical bug" was also blamed for causing one of MAI's Bay Networks routers not to detect the erroneous data. Furthermore, the routing tables Sprint received were designated as optimal, which gave them higher credibility than otherwise. Something like 50,000 routing addresses all pointed to MAI.1
Internet nameserver problem affects .com and .net domains. Around 11:30 p.m. EDT on July 16, 1997, Network Solutions Inc. attempted to run the autogeneration of the top-level domain zone files, which resulted in the failure of a program converting Ingres data into the DNS tables, corrupting the .com and .net domains in the top-level domain name server (DNS), maintained by NSI. Quality-assurance alarms were evidently ignored and the corrupted files were released at 2:30 a.m. EDT on July 17 -- with widespread effects. Other servers copied the corrupted files from the NSI version. Corrected files were issued four hours later, although there were various lingering problems after that.2
Network Solutions goof bumps Nasdaq off the Internet (Will Rodger, RISKS-19.34)
The Nasdaq stock exchange was knocked off much of the Internet for several hours on 19 Aug 1997 as a result of administrative errors at the InterNIC, a centralized Internet address clearinghouse run by Network Solutions Inc. of Herndon, Va. Though the problem was initially invisible to Nasdaq, which maintains its own database of Internet addresses, the temporary suspension of access to the exchange's site blocked users of major computer networks - including those owned by IBM Corp., MCI Communications Corp., PSINet Inc. and UUnet Technologies Inc. As a result, Nasdaq was unreachable to most Internet users for at least several hours Tuesday morning. Problems with the Web site had no effect on the functioning of Nasdaq itself. The snafu was due to a clerical error at NSI, which evidently lost track of Nasdaq's $50 fee, submitted in October 1996. [PGN Abstracting, from article by Will Rodger, in Inter@ctive Week Online, 21 Aug 1997]
Will remarked that things like this seem to be occurring more often. The weekend before, more than 5,000 Web sites were blocked for over 24 hours, when Web Communication Inc and other domains were bumped from the Internet after a screw-up in routine InterNIC maintenance.
Redundant virtual circuits both fail. A report from Finland indicated that the main and reserve lines between Oulu and Kajaani went through the same physical circuit, despite an agreement with Finnnet that they should be separate.3
MCI Internet gateways choked. MCI's inbound Internet gateways were saturated during July 1994, resulting in days of delay in delivering e-mail to MCI customers. A fix was considered to be months in the offing.4
Vandals cut cable, slow MCI service. MCI's telephone traffic between New York City and Washington was disrupted for almost four hours when vandals removed a 20-foot section of fiber-optic cable in Newark on August 26, 1994.5
Netcom crash. Netcom, Inc. (now part of ICG Communications Inc.) went down for more than 14 hours during the week of June 17, 1996, because of an extra "&" in the border gateway protocol code in the MAE-East router in the Washington, D.C., area. Recovery required that all of the more than 100 routers be brought down.6
Prodigiously prodigal Prodigy commercial. Alan Wexelblat reported seeing a commercial for Prodigy's on-line computer service during Game 6 of the 1994 Stanley Cup finals on ESPN. The ad cut to a live computer screen showing Prodigy. Suddenly, a big window came up on the screen, saying communication error. The ad was talking about how great the hockey game was, but that it it didn't compare to the excitement available on Prodigy. Apparently, at that time Prodigy users observed that the system locked up for almost a minute, and then their screens went completely blank. ESPN quickly cut away to another commercial. The curse of the live demo!7
Prodigy misdirects or loses e-mail messages. A software glitch on March 10, 1995, caused Prodigy's e-mail system to send 473 e-mail messages to incorrect recipients and to lose 4,901 other messages. The system had to be shut down for five hours.8
Microsoft, AT&T, AOL netwoes. Microsoft shut down its nationwide network on June 23, 1996, for 10 hours as part of an intended backup power-supply upgrade, but the upgrade failed and they had to try again.
AT&T had to shut down its Internet access for up to 8 hours each week, for maintenance.
America Online was out of service for an hour on June 19, 1996, when a planned system software upgrade backfired.9 AOL's computer systems (near the Dulles Airport facility in Virginia) went down at 4 a.m. EDT on August 7, 1996. Service was reportedly restored sporadically 19 hours later, around 11 p.m. EDT. The crash was caused by new software installed during a scheduled maintenance update. Earlier in the same week an AOL representative had said that AOL computers are "virtually immune" to this kind of outage.10
On December 2, 1996, AOL's main server building flooded, knocking out the entire AOL network for hours and denying E-mail service for hours more after that.11 On February 5, 1997, AOL's network succumbed to a problem during a software upgrade, and was off the air for more than two hours.12 More extensive AOL e-mail outages were required in early April 1997, when service was suspended for several days in order to do an upgrade.13 Explosion causes Internet blackout in New England (Edupage, R 19 29-30)
More than 200 New England businesses experienced a four-hour Internet blackout on 7 Aug 1997 after an explosion knocked out electrical power in the Boston area. One person was killed in the blast, which overloaded a panel switch at MIT, causing a fire and cutting off Internet access to BBN Planet customers. Access resumed around 10:00. The speed with which the incident happened made it impossible to reroute traffic, said a BBN spokesman. (TechWire, 8 Aug 1997; Edupage, 10 Aug 1997)
No network, no demo (Martin Minow)
Larry Ellison, CEO of Oracle Inc, and a strong proponent of network computers, was demo-ing his network computer at the Oracle OpenWorld conference. Unfortunately, the network crashed and the application hung "and Ellison was left hanging on stage."
Attack on fiber-optic cables causes Lufthansa delays. On February 1, 1995, unknown attackers severed 7 fiber-optic cables near the Frankfurt/Main airport. About 15,000 telephone lines were interrupted. The cables also carried data for Lufthansa's booking computers; consequently, new reservations had to be made manually. As Lufthansa's main computers (at Frankfurt airport) were cut off for some time, delays of up to 30 minutes were caused.14
Ground-cable removal blows Iowa City phone system upgrade. On November 19, 1994, Iowa City's US West telephone system shut down at about 3:30 p.m., local time, and service was gradually restored between 7:30 and 9:30 p.m, affecting about 60,000 people. Analysis showed that a new switching system had been installed in July 1994. In removing the old system, an electrical grounding cable had been inadvertently removed.15
Garbage-truck worker wipes out telephone service. A cowboy garbage-truck driver in Oregon playing the game of "swing the cables" with his fork lift accidentally severed a cable that disrupted service for a wide area of subscribers.16
Disruption from stolen cables. In Ulan-Ude, Russia, a man harvested 60 meters of cable, disabling external phone service on June 19, 1997. Previously, 2 thieves in eastern Kazakhstan were electrocuted trying to steal high-voltage copper wires. In a much older case recalled by Cliff Krieger, a computer backup system failed when it was needed because a cable had been stolen at the Korat Royal Thai Air Force Base in 1973.17
Swedish telephone outage (Danny Kohn) (R 20 29)
After a number of ISDN outages last year and some this year in the country, our nationally owned telco Telia had two big outages in the capital of Stockholm. It happened the first time 15 Mar 1999, when millions of phone lines including the police headquarters' PBX were unusable for 8 hours! The outage was repeated exactly a week later between 10:25am and 11:05am, when incoming calls to the police PBX and to another 250 business PBXs where blocked.
The second outage is explained as an intermittent error that disturbed the communication between PBXs and the telco equipment. In addition the software that would localize the problem had a bug so that the error would not display.
Comming to mind is that telco exchanges are often purchased in international competition. A telco operator can not see through the software. But given the complexity neither can the producer - we might not have bugs if they did. So, if a intruder paid by some nearby country wanted to, he could program some code "detonating" as a part of war attack.
Computer error costs MCI $millions. MCI reported that they will refund approximately $40 million due to a computer error. This was the aftermath (!) of a slight billing error uncovered by investigative reporters from a local television station, WRIC in Richmond, Virginia, who in pursuing it found that it was a widespread phenomenon.18
Bell Atlantic 411 outage. On November 25, 1996, Bell Atlantic had an outage of several hours in its telephone directory-assistance service, due apparently to an errant operating-system upgrade on a database server. For unknown reasons, the backup system also failed. The result was that for several hours 60% of the 2000 telephone operators at 36 sites had to take callers' requests and telephone numbers, look up the requested information in printed directories, and call the callers back with the information. Apparently, the problem was solved by backing out the software upgrade. This was reportedly the most extensive such failure since operators began using computerized directory assistance.19
MFS Communications switch fails, with widespread effects (Steven Bellovin)
Around 7 p.m. on the evening of 8 Sep 1997, the main MFS Communications switch (MFS Switch One) failed, downing UK telecommunications links provided by MFS, Worldcom, and First Telecom. The outage also affected most of CompuServe's UK customers, whose access is typically via an MFS phone number. [PGN Stark Abstracting. Evening usage is not necessarily off-peak, because it is an excellent time to access computers in the U.S. No one has yet reported how long it took to restore service. PGN]
Satellite transmission snafu leads to diplomatic incident (Nick Brown)
On 19 Jul 1997, a "technical error" caused the contents of a channel on a satellite (operated by France Telecom) to be transmitted on another channel, for about twenty minutes. Normally this would have been merely annoying for the viewers. However, these viewers were in (among other places) Saudi Arabia, the channel they expected to be watching was the French government-run, general interest and news station, Canal France International (CFI), and the program which replaced it was a hard-core pornographic movie that should have been shown on the subscription-only, encrypted French domestic station, Canal Plus. As a result, Arabsat cancelled its contract with France Telecom, claiming that France Telecom had not "honoured its commitment to respect Arabic and Islamic values." The French Foreign Ministry and the French Ambassador in Riyadh are trying to calm what has become a diplomatic incident.
Indian satellite failure (Scott Lucero)
According to the 6 Oct 1997 Daily Brief, officials in India say the country's most advanced communications satellite was abandoned on 5 Oct 1997 due to a power failure aboard the craft. The loss of the satellite reportedly affected communications to remote parts of the nation and the operation of satellite-dependent functioning of India's stock exchange. This appears to be an example of the familiar RISK of having a single point of failure, or, more colloquially, putting all your eggs in one basket.
Blown fuse takes out 911 system. A blown fuse took out a large portion of Iowa's 911 emergency phone system for three hours over the 1996 Thanksgiving weekend. U.S. West could not say how many 911 calls went unanswered. A spokesperson said that the problem came from the complexity of the system.20
San Francisco 911 system woes. San Francisco tried for at least three years to upgrade its 911 system, but computer outages and unanswered calls remain rampant. For example, on October 12, 1995, the dispatch system crashed for over 30 minutes in the midst of a search for an armed suspect (who escaped). The dispatch system was installed two months before as a temporary fix to the recurrent problems, and it too suffered unexplained breakdowns. Screens freeze; vital information vanishes; and roughly twice a week the system crashes. Dispatchers are not able to answer between 100 and 200 calls a day. Many nonemergency calls are also being lost. The reported extremely stressful working conditions seem similar to those experienced by air-traffic controllers. The 911 system collapsed again on November 4, 1995, for an hour; the absence of an alarm left the collapse undetected for 20 minutes.21
Software bug cripples Singapore phone lines. A bug in newly-installed computer software corrupted one of the two common channel signaling systems, affecting 26 out of 28 exchanges, and knocking out two-thirds of Singapore's telephone lines on October 12, 1994. Handphones, fax machines, pagers and credit cards were all hit by the disruption, which began at 11:31 a.m. in the City Exchange. It took Singapore Telecom's engineers about five hours to get services back to normal again. Fortunately the old backup system was still running side by side with the new system.22
Calling-Number ID ghosts calls. In early March 1995, a Detroit area woman looked at her Calling-Number Identification unit (misnamed Caller ID) and was puzzled to notice that it indicated 19 received calls that evening, even though only one person had called. Then she checked the names listed. John F. Kennedy. Thomas Paine. Harry S Truman. John Hancock. Ulysses S. Grant. Samuel Clemens. Ronald Reagan. And many others. Most of the phone numbers were non-working, but a few were. A neighbor had also been plagued with phone calls for Abraham Lincoln. Ameritech believes the Caller ID box was probably a pre-programmed demonstration model, although a telecommunications consultant suspected the work of a phone hacker.23
Does CNID blocking really give you anonymity? From the time of an upgrade on January 1 until January 26, 1997, the mechanisms that were supposed to block Calling Number ID failed in the 510 and 415 areas codes. Numerous businesses with PBXs were able to obtain calling numbers despite presumed blocking.24
Telstar 401 catastrophic failure. On January 11, 1997, AT&T's Telstar 401 satellite went dead, with a full complement of both C and Ku band transponders. Technicians were unable to reestablish contact. The satellite normally carries both broadcast network and syndicated television programming. The networks, as "platinum" customers, were quickly switched to an alternative bird. Almost everyone else was scrambling to find transponder space for their programming. The risk? Don't assume that a satellite will always be there!25
SpaceCom technician disables millions of pagers. At the SpaceCom uplink facility in Tulsa, Oklahoma, an operator accidentally sent out a command shutting down the satellite receivers used by pager systems throughout the country, affecting millions of pagers. SpaceCom supports 5 of the largest 10 paging outfits. This happened at 1 a.m. on September 26, 1995, and each receiver had to be manually reprogrammed -- which took all day until most of the service could be restored. Apparently, the operator omitted a carriage return at the end of a line, which is sort of the inverse of intending to type rm *.log but accidentally fat-fingering the carriage return just after the asterisk.26
Playboy strikes again. TCI's cable-TV provider in Springfield, Missouri, was testing its planned inclusion of the Playboy Channel (to begin in February 1997), when the Cartoon Network Channel suddenly began airing the Playboy video along with the regularly programmed Flintstones' audio. The results were perhaps more noticeable than they might have been, because bad weather had closed the local schools and children were at home.27There seems to be something magnetically RISKS-attractive about the Playboy Channel, which appeared unscrambled in the Palo Alto area. A city-wide power outage (see Section ) on August 13, 1996 fried the Palo Alto Cable Co-op circuit board that normally scrambles the Playboy Channel, despite surge protection. When power was restored, the Playboy Channel went out unscrambled. To make matters worse, Co-op's phone system had died when the standby batteries ran down.28
A Playboy Channel program (PC is a nicely overloaded acronym, because the Personal Computer program was presumably not Politically Correct!) had previously appeared in the Jeopardy time-slot in the Chicago area for 10 minutes, due to a screwup.29
Woodpeckers delay shuttle launch. Yellow-shafted flicker woodpeckers chipped away at the insulating foam on the space shuttle Discovery's external fuel tank, causing at least 71 holes, from half-inch to four inches in diameter, and delaying the scheduled launch.30
Ariane-5 problems. Following the failure of the main cryogenic motor during an attempted Ariane-5 launch on May 5, 1995, and the death of two technicians resulting from asphyxiation due to a nitrogen leak (in Cayenne, at the French Guiana Space Centre), another test on May 30, 1995, was aborted by the computer control system several seconds after ignition of the new European rocket.31
On June 4, 1996, another Ariane-5 exploded, due to faulty software in the inertial guidance system. Software from Ariane-4 had been reused in Ariane-5 without testing. When subjected to the higher accelerations produced by the Ariane 5 booster, the software (calibrated for an Ariane 4) ordered an "abrupt turn 30 seconds after liftoff", causing the airframe to fail. Apparently, conversion from a 64-bit floating representation to a 16-bit signed representation caused an Operand Error.32
Final report on the SOHO spacecraft problems
We reported earlier on the NASA/European Space Agency Solar and
Heliospheric Observatory (SOHO) spacecraft on 24 Jun 1998 (R 19 87).
Nancy Leveson gave a preliminary analysis (R 19 90), followed by a
later note from Craig DeForest (R 19 94) summarizing the final report
of the Investigative Board, as follows. The proximal cause of the
loss was a mis-identification of a faulty gyroscope: two redundant
gyroscopes, one of which had been spun down(!), gave conflicting
signals about the spacecraft roll rate, and the ops team switched off
the functioning gyro. The spun-down gyro became SOHO's only
information about roll attitude, causing SOHO to spin itself up on the
roll axis until the pre-programmed pitch and yaw control laws became
unstable. This was the last in a series of glitches in the
operational timeline on 24 Jun; the full story is on-line
(http://umbra.nascom.nasa.gov/soho/SOHO_final_report.html).
There were many other factors leading to the loss. The report reads like a roll call of well-known risky behaviors, including a staffing level too low for periods of intensive operations; lack of fully trained personnel due to staffing turnover; an overly ambitious operational schedule; individual procedure changes made without adequate systems level review; lack of validation and testing of the planned sequence of operations; failure to carefully consider discrepancies in available data; and emphasis on science return at the expense of spacecraft safety.
[Contact with SOHO was subsequently re-established, and - following thawing of the frozen hydrazine rocket fuel on board - full attitude control seems to have been restored, allowing recommissioning and testing of the spacecraft and instruments.]
Titan IV explodes with Vortex satellite; total cost over $1B The Lockheed-Martin Titan IV that began self-destructing at 20,000 feet only 40 seconds after liftoff from Cape Canaveral carried a top-secret satellite (code-named Vortex) for the U.S. National Reconnaissance Office. It was destroyed on ground command two seconds later. The Air Force gave no information on the cause. This was the final launch for this Titan IV model; future launches are already scheduled to use an improved model. [Source: Reuters item, 13 Aug 1998; PGN Abstracting]
Only two failures out of 25 launches is reportedly thought to be a reasonably good record, although this loss is expensive - $300M for the Titan, and between $800M and $1B for the satellite. Associated Press noted that a previous Titan IV failure occurred from Vandenberg AFB in August 1993. (There was also a Titan IV motor that blew up on the test stand on 1 April 1991 (R 12 09), as a result of a problem that seemingly could have been caught in simulation.) Further commentary in (R 19 93).
More satellite woes: Ikonos 1 lost, Titan 4B puts Milstar in worthless orbit; Delta III does same for Orion (PGN)
In 1994, the U.S. Government authorized Space Imaging to launch a private imaging satellite, for beneficial public uses. Ikonos 1 was finally launched on 27 Apr 1999, but contact was mysteriously lost 8 minutes later (R 20 36). No further details have emerged.
A $433M Titan 4B rocket launched on 30 Apr 1999 apparently triggered separation of the payload four hours early, and placed an $800M Milstar satellite in a low elliptical orbit rather than a geostationary one (R 20 36). The blame was placed on Lockheed Martin engineers loading faulty software (R 20 39). This was the third Titan failure in a row - following the Titan 4A with a Vortex satellite last August 1998 in a mission with comparable costs (R 19 91), and a missile warning satellite on 9 Apr 1999 stuck in a useless orbit.
Then, on 4 May 1999, a Boeing Delta III rocket launch dumped Loral's Orion intended geostationary communications satellite in an elliptical orbit with a max of 862 miles. A previous launch try two weeks before had gone to the countdown of zero, but a software flaw prevented ignition (R 19 38). The first Delta III launch ended after 71 seconds when the rocket exploded because of a software flaw that caused the hydraulic fuel to be expended prematurely.
Russian rocket blows 12 Globalstar satellites Globalstar (42% owned by Loral Space and Communications) used a Yuzhnoye (Ukraine) rocket for the 10 Sep 1998 launch from Baikonur (Kazakhstan) of 12 Globalstar satellites intended to be part of a world-wide wireless phone network. Two separate computer faults 4.5 minutes after launch reportedly resulted in the complete loss of the rocket and the satellites. [Source: Dan Fost, San Francisco Chronicle, 11 Sept 1998, A1 (R 19 95)]
Missing bounds check? Off-by-one error? Hardware? All your eggs in one basket? Not really. Globalstar is shooting for 52 low-orbit satellites. Cheaper by the dozen? This one cost $270M for the satellites ($190M expected to be covered by insurance!), and about $100M for the rocket.
Peter Ladkin (R 19 97) discussed reports that the malfunction resulted in the failure of the Zenit booster. The Energomash second-stage booster was shut down prematurely. Apparently, two of the three primary flight-control computers shut down.
Cruise Missile software bugs. During the Iraqi war, bomb damage assessment of the initial cruise-missile strike indicated that three of the 10 targets attacked by 13 Air Force CALCMs (Conventional Air-Launched Cruise Missiles) emerged with `no detectable damage.' The Boeing CALCMs (earlier intended as nuclear weapons) had been adapted for being launched from B-52H bombers over the Persian Gulf, but without making software changes necessary for the new uses.33
Accidental missile launch: color-code mixup. (R 18 40) The Canadian Navy mistakenly launched an unarmed missile at a town near Victoria, B.C. on August 28, 1996, hitting a residential garage and narrowly missing a food store and day-care center. Sailors were testing weapons systems aboard the HMCS Regina at 11 a.m. when the missile was fired at the town of View Royal on Vancouver Island. Apparently, an unarmed live missile had been substituted for the intended dummy, because of a mixup relating to the color-coding of the missiles. While the test called for a green "inert test set," which contains no propellant and therefore could not launch, a blue "inert practice round" was mistakenly used. The military has since suspended all such testing on both coasts and ordered an inquiry. Although nobody was injured, residents of the bedroom community of 6,000 people say things could have been much worse. Thirty-two children were a half-block away at the Tiny Tots Day Care Centre when the incident occurred.34
Navy software problems (Michael Stutz via Jim Horning) If you think Windows 98 is an upgrade nightmare, consider the task of adding a new combat system to a Navy cruiser. Last week the US Navy acknowledged that two prized battle cruisers (the USS Hue City and the USS Vicksburg) will be out of commission until further notice as engineers try to integrate new onboard weapons-control systems. "Microsoft comes out with upgrades every three years, and they crash all the time," said one Navy source, who spoke on condition of anonymity. "The Navy comes out with upgrades every five years, but we can't afford for our systems to have any glitches, so we have to make sure that we get it just right."
The heart of the problem lies with two new systems being built into the ships. The Aegis Baseline 6 system helps defend the vessels against air attacks, and the Cooperative Engagement Capability (CEC) system gathers and shares radar data from multiple ships. Engineers are having trouble getting the new systems to work with each other and with the ships' legacy software.
[Aegis is written in Ada and C++ and other languages, with the latest
upgrade reaching 8M lines of code, up from 3M. Installation is taking much
longer than expected. The problems are largely in integration and
interoperation, including a new display system, and are compounded by the
Navy not having source code. PGN Abstracting from "Navy Software Dead in
the Water" by Michael Stutz, 16 Jul 1998
<http://www.wired.com/news/news/technology/story/13758.html>]
USS Yorktown dead in water after divide by zero (R 19 88)
The Navy's Smart Ship technology is being considered a success, because it has resulted in reductions in manpower, workloads, maintenance and costs for sailors aboard the Aegis missile cruiser USS Yorktown. However, in September 1997, the Yorktown suffered a systems failure during maneuvers off the coast of Cape Charles, VA., apparently as a result of the failure to prevent a divide by zero in a Windows NT application. The zero seems to have been an erroneous data item that was entered manually. Atlantic Fleet officials said the ship was dead in the water for about 2 hours and 45 minutes. A previous loss of propulsion occurred on 2 May 1997, also due to software. Other system collapses were also indicated. (One quote suggested the ship had to be towed, but another refuted that.) [Source: Gregory Slabodkin, Software glitches leave Navy Smart Ship dead in the water, Government Computer News, 13 Jul 1998, PGN Stark Abstracting] Discussion in RISKS included further comments about Windows memory management, the use of NT, smart-ship technology, and COTS in battle-critical applications (R 19 88-92); doubts about official reports (R 19 91) and confusions therein (R 19 94), as well as speculations on the hardware behavior (R 19 92-93), and still more discussion (R 19 94). This case holds many lessons for the future, in the true spirit of RISKS, including a reminder from the 19th Century British Navy (R 19 89).
Revisiting the USS Yorktown dead in the water (Mike Martin, R 20 37)
The March 1999 Scientific American included a letter from from Harvey McKelvey, former director of Navy programs for CAE Electronics, the firm which apparently built the misbehaving Windows NT application on the Yorktown (R 19 88 ff.), widely attributed to an unchecked divide by zero. [PGN-ed]
McKelvey writes that the failure "was not the result of any system software or design deficiency but rather a decision to allow the ship to manipulate the software to stimulate [sic] machinery casualties for training purposes and the `tuning' of propulsion machinery operating parameters. In the usual shipboard installation, this capability is not allowed." McKelvey adds that CAE Electronics expressed "serious concern" when this test was proposed.
So it seems that as long as there are no "machinery casualties", everything will be fine. Then again, the incident may have provided useful information to improve system robustness. (Mike Martin)
Chinook helicopter engine software implicated? (Mike Ellims) (R 19 51)
In 1994, a Chinook helicopter crashed into hills on an island off the coast of Scotland, killing 29 people. At the time the engine control software was absolved of blame, although problems with it were known to exist. The Minister of Defense was quoted as saying of the software that 485 observations were made but none was considered safety-critical.
In recent weeks Channel 4 in Britain raised the question of whether or not there were actually serious problems with the software, via a leaked report from EDS-Scicon. This report listed 56 category-1 errors (most serious), which indicate either a coding error or non-compliance with documentation. A further 193 errors were listed as category-2 errors, which relate to the quality of the code. It was further alleged on Channel 4 that the RAF test pilots who develop operation procedures etc. for new aircraft refused to fly the helicopter. The aircraft was introduced into operational service, but with restrictions on load that do not apply to the Mark-1 version. The official line is that there is no shred of evidence to suggest that anything other than pilot negligence caused the crash. However, there is some possibility that another investigation into the crash may occur.
Stansfield Turner's new book includes near-war risk (R 19 43)
Admiral Stansfield Turner's book, Caging the Nuclear Genie, describes an incident that occurred on 3 June 1980 when he was President Carter's CIA director. Colonel William Odom alerted Zbigniew Brzezinski at 2:26 a.m. that the warning system was predicting a 220-missile nuclear attack on the U.S. It was revised shortly thereafter to be an all-out attack of 2200 missiles. Just before Brzezinski was about to wake up the President, it was learned that the "attack" was an illusion - which Turner says was caused by "a computer error in the system." His book makes various suggestions that would greatly reduce the threats of accidental nuclear war. "We have had thousands of false alarms of impending missile attacks on the United States, and a few could have spun out of control." [Source: Keay Davidson, San Francisco Examiner, in the San Francisco Sunday Examiner and Chronicle, 19 Oct 1997, p. A-17.]
Missile passes American Airlines Flight 1170 over Wallops Island. At 1:45pm on August 29, 1996, American Airlines flight 1170 was flying over Wallops Island, Virginia, en route from San Juan to Boston, when the captain reported "a missile off the right wing". The location is close to the Wallops Flight Facility (Section ), with nearby Navy installations at Norfolk and Lexington Park. 35
F-16 risky incidents involving TCAS. Several incidents involved F-16s and commercial airliners with Traffic Alert and Collision Avoidance System (TCAS).
1. On February 5, 1997, two Air Force F-16s closed on a Nation's Air Boeing 727 passenger jet heading for New York's JFK Airport. A TCAS alarm caused the 727 pilot to take evasive action, flooring three passengers and crew members. This occurred in a fairly large restricted area through which the 727 had been cleared to fly. One of the F-16 pilots had earlier identified the 727 as a passenger plane, but continued to chase it "as an intruder into his airspace". The instructor pilot told his trainee pilot to stay out of the way "till this, uh, bozo gets out of the airspace." He was eventually ordered to stop the chase, but "the command may have been delayed because the fighter pilot was on the wrong frequency" (according to the Air Force report).
2. On February 7, 1997, four Air National Guard F-16s from Andrews Air Force Base passed an American Eagle commuter plane bound from Raleigh to NY. Three of the F-16s were above the commuter plane, one below. A TCAS alarm caused the American Eagle pilot to take evasive action.
3. On the same day, two Air Force F-16s entered the safety zone around an American Airlines jet over Palacios, Texas.
4. On the same day, two Air Force F-16s entered the safety zone around a Northwest Airlines jet over Clovis, New Mexico.
The Air Force insists that none of these cases was a close call (that is, with less than 500 feet separation), and that such close encounters have happened routinely in the past without causing concern -- before the advent of TCAS. So, we can chalk this up either as an indication that TCAS works (albeit too well?), or as a failure of the Air Force to understand the risks of false alarms in someone else's safety system!36
Another TCAS incident. An erroneous command in TCAS nearly resulted in a midair collision on June 4, 1995, involving a United 737 and a Viscount Air Services 737, both on approach. After both aircraft received a TCAS warning, the United airplane began to climb from 10,000 feet to 12,000 feet while the Viscount plane started to descend to 10,000 feet. The two aircraft came within 200 feet of each other before controllers instructed the Viscount flight to return to 11,000 feet. The incident occurred in the "Northwestern portion of the U.S."37
***APPEND TO A FLY-BY-WIRE ITEM:*** 38
NY Air Route Traffic Control Center computer failure. The NY ARTCC computer lost significant service capability twice on the evening of May 20, 1996 -- the first time for 23 minutes, and the second time for about an hour, one hour later. The FAA had installed new software four days earlier.39
Power outage downs Pacific Northwest air traffic.
A technician accidentally pulled the wrong circuit board, cutting off all
power to an air-traffic control center for five minutes on January 6, 1996.
150 flights were delayed for more than an hour, throughout the Pacific
Northwest. Controllers used car cell-phones to communicate with pilots via
other air-traffic control centers. Backup power failed because the damaged
unit also controlled switchover.40
More ATC problems, fall 1998:
New air-traffic control radar systems fail, losing aircraft at O'Hare
(R 20 07); Dallas-FortWorth ARTS 6.05 TRACON gives ghost planes,
loses planes (one for 10 miles), one plane on screen at 10,000 feet
handed off and showing up at 3,900 feet! 200 controller complaints
ignored, system finally backed off to 6.04 (R 20 07); near-collision
off Long Island attributed to failure at Nashua NH control center
(R 20 11); TCAS system failures for near-collision over Albany NY
(R 20 11); two more TCAS-related incidents reported (R 20 12);
landing-takeoff near-miss on runway at LaGuardia in NY (R 20 13);
discussion on trustworthiness of TCAS by Andres Zellweger, former
FAA Advanced Automation head (R 20 13)
Dulles radar fails for half-hour 23 Nov 1998 (R 20 10);
discussion of air-traffic control safety implications (R 20 11),
and ensuing comments from a controller (R 20 12)
Computer glitches foul up flights at Chicago airports (Keith Rhodes)
The Chicago area TRACON in Elgin was testing new software on 5 May 1999 that displays aircraft sizewise. As a result of problems, there were serious traffic problems at O'Hare and Midway. Even after fixes were made, delays continued. United cancelled 25% of its afternoon flights, American 13%. [Source: Associated Press, 5 May 1999; PGN-ed; R 20 38]
Brief KC power outage triggers national air-traffic snarl (R 19 51)
Power went out at Kansas City's Olathe Air Route Traffic Control Center at 9:03a.m. CST on 18 Dec 1997, resulting in a "brief and supposedly impossible power failure" [1]. A technician routed power through half of the redundant "uninterruptible" power system, preparatory to performing annual preventive maintenance on the other half. Unfortunately, he apparently pulled the wrong circuit board, and took down the remaining half as well. The maintenance procedure also bypassed the standby generators and emergency batteries. The resulting outage took out radio communications with aircraft, radar information, and phone lines to other control centers. Power was out for only 4 minutes, communications were restored shortly thereafter, and backup radar was working by 9:20a.m. However, at least 300 planes were in the Olathe-controlled airspace at the time, and the effects piled up nationwide. Hundreds of flights were cancelled, diverted, or delayed. There were delays of up to 2 hours, and delays continued into the evening. [Sources: 1. Matthew L. Wald, The New York Times, 19 Dec 1997; 2. Kansas City Star, 19 Dec 1997.]
The Times article noted that this is the latest in an "improbable series of problems". The NY Terminal Radar Approach Control (TRACON) was shut down almost completely on 15 Oct 1997, because of dust from ceiling tiles, and a similar situation occurred at the Jacksonville center. The TRACONs at Dulles and O'Hare were closed when fumes invaded the ventilation systems. A response from Bill Murray (R 19 52) suggested these events are not improbable at all.
Review of air-traffic control outages (Peter B. Ladkin)
Outages (complete failures, a distinct from degradation of service due to partial failures) of air traffic control computer systems, particularly those at the U.S. Air Route Traffic Control Centers (ARTCCs) have been a subject of continuing interest in RISKS (a keyword search on the archive showed well over a hundred references, many of which refer to partial failures or outages).
The U.S. National Transportation Safety Board (NTSB) prepared a report in January 1996 (NTSB/SIR-96-01) on ATC system outages, dealing with incidents between September 12 1994 and September 12 1995 and assessing the FAA's modernisation program. There is a significant `legacy' problem with some of the systems, and the scope of the FAA's Advanced Automation System (AAS), which has been in development since 1981, was significantly revised downwards when the contract for the `be all to end all' system was cancelled by the FAA in mid-1994 because of continual schedule slippage and cost overruns. The NTSB report discusses the architecture of the display systems in the ARTCCs, the nature of the outages (4 power failures, 7 computer problems), and the FAA's upgrade plans (which crudely amount to replacement of legacy systems in an evolutionary manner, rather than a redesign). In the 11 incidents, only one operational error (loss of separation) was reported, although all involved degradation of service (i.e. delays) ranging from the trivial (1) to the extreme (485). The report also notes that many controllers do not appear to be aware of the full range of functions still available to them during partial degradation. The board concludes that the system remains `very safe', even though the failures have a significant economic impact, but is concerned about the safety implications of the increasing number of failures of the older equipment.
The AAS is considered a `high-risk program' by the U.S. General Accounting Office (GAO), which has produced a series of reports, the latest from 1997 being on the WWW. A `high-risk program' is one at `high risk for waste, fraud, abuse and mismanagement' (!).
The NTSB report is now available on the WWW in the compendium `Computer-Related Incidents with Commercial Aircraft', which also contains links to the GAO reports: http://www.rvs.uni-bielefeld.de, click on `Computer-Related Incidents' then `U.S. Air Traffic Control Center Outages and the Advanced Automation System'.
[Other recent additions to the compendium include the Rapport Preliminaire of the French DGA on the A330 test flight accident in Toulouse (Mellor, RISKS-16.19; Jackson, Ladkin, 16.22, Ladkin, 16.23; Hollnagel, 16.31; Ladkin, 16.39); the final report on the Lauda Air B767 accident (Leyland, 11.78; Grodberg, Kopetz, Morris, Philipson, 11.82; Neumann, 11.84; Mellor, 11.95; Leveson, 12.16; Leveson, 12.69); the 1985 China Air B747 accident (Trei, 3.79); and the 1983 Eastern Airlines L1011 Common Mode Failure incident (not itself computer-related, but I believe relevant for understanding common mode failures resulting from imperfect maintenance, as in the 1996 Aeroperu B757 accident, Ladkin, 18.51; Neumann, 18.57; Ladkin, 18.59).]
I am very grateful to Hiroshi Sogame of the Safety Promotion Committee of All-Nippon Airways for his public service in preparing various of these and other reports for the compendium.
Boeing 777 alarms triggered by fruit/frog cargo. False alarms on the Boeing 777 have been triggered by unusual humidity and temperature conditions in cargo holds. For example, a London-bound Emirates aircraft was diverted to Cyprus, due to heavy-breathing mangos, and a Cathay aircraft was evacuated and the fire-suppression system activated -- due to a combination of fruit and frogs. Apparently, tropical fruit (and especially durian fruit) generates enough humidity to be detected as smoke -- thereby triggering the alarms.41
Aviation risks using Windows NT avionics systems (Peter B. Ladkin) (R 19 46)
An article `Windows added to cockpit choices' in Flight International, 5-11 November 1997, p 25, explains that the US company Avidyne has certificated an avionics system based on Windows NT. The hardware supplier is Electronic Designs, who has recently received approval from the FAA (approval for what is not specified). Avidyne is apparently working on Level-C approval, which will allow use of its moving-map display for IFR navigation. One of the benefits is said to be the wide range of interfaces available to other devices.
This is for general aviation. The first Supplemental Type Certificate (required FAA documentation for installation) is for a Mooney piston single.
One major drawback could arise from the hardware. As noted here earlier, the Pentium and Pentium MMX chips may be halted by execution of a single instruction in any mode, independent of any memory protection in the operating system. This instruction (in machine language) is F0 0F C7 C8 in hexadecimal.
If Electronic Design's box is Pentium-based, the FAA could therefore shortly be asked to certificate a design for IFR flight that can be halted in mid-use. Unavoidably. By a few lines of software that are trivial to write. I would hope I am not alone in feeling very uncomfortable about the precedent this might set for acceptance procedures for COTS products in safety-related environments.
This is a static bug, so programs are already available (see RISKS-19.45 for one) which sweep through your software to determine if this instruction is somewhere therein. But I wonder if the FAA will insist that Avidyne install such programs and make it a required part of the use of the equipment that this program is run as part of the pre-flight check before flight under IFR? However, even this does not guard against programs which dynamically generate this instruction.
For the history of a dynamically-generated instruction that halted the Shuttle flight-control software in 1981, recounted at length in the Communications of the ACM 27(9), September 1984, pp.874-900, see our compendium `Computer-Related Incidents with Commercial Aircraft' (http://www.rvs.uni-bielefeld.de).
Another air-traffic-controller spoofer. Someone using a hand-held, battery-operated transmitter gave out false information to aircraft landing at Manchester airport in the U.K.42
Radar blip lost Air Force One (Doneel Edelson)
The Federal Aviation Administration is investigating whether an air-traffic tracking system went out amid reports that Air Force One vanished from radar screens for 24 seconds. Broadcast reports said the airplane disappeared from radar screens on the morning of 10 Mar 1998 while President Clinton traveled to Connecticut. The long-range radar system at the center reportedly has a history of momentary blips. [USA Today, 11 Mar 1998]
Air-traffic control upgrade problems The Northeast Air Traffic Control Center in Nashua, New Hampshire, reverted to the old voice-and-paper-slip backup system for 37 minutes on 19 Aug 1998, because of a computer failure. 350 planes were being handled at the time. The system also failed again the next day. Over 100 system failures have been reported already this year at that center. William Johannes, president of the National Air Traffic Controller's Association, said, "It's like a Chevy with 485,000 miles on it and you are trying to stretch it. The longer it goes, the more times we are going to have failures." The mainframes ("aging equipment") are supposed to be replaced beginning in 1999, with a new display system expected in 2000. [Source: David Tirrell-Wysocki, Computer crash cripples New Hampshire air traffic controllers, Associated Press, 21 Aug 1998] (Are we hoping that the Y2K impact on the ATC system will last only 37 minutes?)
The attempted upgrade to ARTS 6.05 at the Dallas-FortWorth Air Route Traffic Control Center reportedly had ghost planes appearing on screens, and real planes missing from screens. Eventually, the FAA admitted there were problems and backed off to the previous version of ARTS 6.04 (R 20 07). But this is the same software in use experimentally in Chicago and several other heavy-traffic centers. The new software is supposed to solve the Y2K compatibility problem, as well as allow double-stacking of planes flying into Chicago's O'Hare (R 20 07). But controller complaints of malfunctions in the Airport Surveillance Radar-9 system forced Chicago to back up to the earlier version (which is non-Y2K-compliant). After the outage in the new system, a backup system was activated - but it had a 20-mile blind spot to the north. [Source: Gilber Jiminez, Chicago Sun-Times, 14 Nov 1998.] The new software was still being used in New York, Denver, and Southern California when we went to press. Of course, the standard statement is that "everything is perfectly safe" - although the increased stress on controllers should not be ignored. FAA Administrator Jane Garvey says not to worry, and she will fly cross-country on New Year's Day 2000. Several Internet wags suggested that will present no problems at all - because her plane may be the only one in the air at the time!
By the way, the Salt Lake ATC center lost both primary and backup radar for about a minute on 4 Nov 1998, with the blackout affecting 200 planes in the air over Utah, Nevada, Idaho, Montana, and Wyoming (R 20 05).
Western states ATC glitches. Radio communications between pilots and air-traffic controllers vanished for one minute on August 11, 1995 (until the backup system could be engaged), over a 200,000 square-mile area including all of Washington state and parts of Oregon, California, Nevada, Montana, and Idaho. The problem resulted from a software glitch in a 2-month-old $1.4 billion computer system at the regional center in Auburn, Washington. "The FAA says the new system, which replaces one dating from the 1950s, is more reliable and flexible, safer, easier to repair and provides better voice quality when controllers talk to pilots."43
More air-traffic control problems. Further air-traffic-control snafus occurred in Chicago, Miami, Washington DC, Dallas-FortWorth, Cleveland, New York, and Pittsburgh, and Oakland, California, in a very short period in the summer of 1995. These cases are documented in the RISKS archives. There were three outages in the Chicago center in one week in July 1995. In one of two Oakland outages, on August 9, 1995, the ATC system lost all radar and radio contact with airborne planes, during maintenance. In Miami on August 12, 1995, lightning knocked out the main power and the backup for more than an hour. Chicago failed again on September 12, 1005, and Oakland twice more on September 13, 1995, when a microwave link failed. Pittsburgh briefly lost radio and radar contact on September 23, 1995.44 The main systems are in many cases over 30 years old, and the backups even older.
Aeroperu 757 crash. The fatal crash of Aeroperu Flight 603 was blamed on the fact that masking tape used in maintenance had not been removed from the left-side static port sensors.45
Korean Airlines KAL 901 accident in Guam
The Guam KAL 801 crash killed 225 of 254 on board. A bug was uncovered in upgraded software that had existed worldwide (R 19 29), relating to incorrect barometric altimetry in the Ground Proximity Warning System (GPWS). See a detailed analysis by Peter B. Ladkin and other discussion (R 19 37-38).
Dominican Republic 757 crash. Investigators are investigating the February 6, 1996, Boeing 757 flight that ended in the ocean, killing all 189 people aboard. Early reports suggest that the disaster may have been due to a faulty airspeed indicator that misled pilots, leading them to believe that their speed was adequate when they were flying at 7000 feet.46
Airbus autopilot failure? (Chuck Weinstock)
On 19 Apr 1999, an Air India Airbus 320 en route from Singapore to Bombay via New Delhi apparently had an autopilot failure at 27,000 feet, resulting in a dive that injured three crew members (two seriously) and an infant. The pilot was able to regain control, and manually flew the jet to Bombay. [Source: AFP, 19 Apr 1999]
Another London train crash. A London commuter train carrying about 400 passengers from Euston Station crashed into an empty train heading into Euston Station, killing one passenger and injuring about 100 near Watford Junction in Hertfordshire, 20 miles north of London, in the afternoon rush-hour on August 8, 1996. Signaling and train systems apparently worked properly.47
Stack overflow shut down new Altona switch tower on its first day. Klaus Brunnstein reported that on Sunday evening, March 12, 1995, the Bundesbahn (German Railway) attempted to replace its old railway switch tower at the heavily used Hamburg-Altona station, installing a fully computerized system from Siemens' railway technology branch. However, the central computer failed immediately. Two days later, Siemens' experts finally identified a stack-overflow condition that resulted in a deadloop, and the bug was finally fixed by Wednesday morning. Nevertheless, because the switchmen were not accustomed to the new system, there was still only restricted traffic days later. Apparently, the programmers had assumed that the stack-overflow routine would never be used!48
S-Bahn stopped by new switching software. Debora Weber-Wulff reported that in October 1996, the Berlin S-Bahn installed new light-rail switching software on the same weekend that the light rail was moved back from the regular train track to its own tracks -- which had been under repair. The tracks were cut off all weekend, with buses attempting to move passengers. The software was installed at a central switching board, so that the transportation company can save the money they would otherwise pay people to manually move the switches. The software kicked in, and all went well until rush hour --when a stack overflow occurred, as in Hamburg! (Siemens also wrote this software -- perhaps it was the same code?) It took hours to get the system back up.49
New York City subway crash. A New York City subway train crashed into the rear end of another train on the Williamsburg Bridge on June 5, 1995. The motorman apparently ran through a red light. The safety system did apply emergency brakes, as expected. However, the safety parameters and signal spacing were set in 1918, when trains were shorter, lighter, and slower, and the emergency brake system could not stop the train in time.50
Amtrak mainline train collision in Maryland. A train wreck occurred in February 1996, when an Amtrak train leaving the Washington area was switched around a stopped freight train; it then had a head-on collision with an inbound MARC train that had failed to slow for a warning signal and was going twice its expected speed. The warning signal for the inbound train had previously been moved to a position before the station stop, from its earlier position after the station.51
Amtrak ticket system breaks down. On Friday, November 29, 1996, Amtrak's nationwide reservation and ticketing system was rendered almost useless by a breakdown in the network, during the Thanksgiving weekend -- usually the heaviest travel weekend of the year. The outage caused enormous confusion and delays, because agents typically had no printed schedules and fare tables, and had to issue tickets by hand!52
Washington D.C. Metro crash kills operator. In the monster Washington D.C. snowstorm in early February 1996, a Metro train operator was killed when his train ran into the back of a parked train at the Shady Grove station, while he was taking the train out of service. There was considerable early confusion about whether the train was running on automatic, whether the operator had requested cutover to manual control, and whether that request had been denied. Apparently the request was made and denied, on the grounds of conforming to standard practice. That standard practice has now been changed.53
MARTA train jumps track. On June 1, 1996, a commuter train operated by the Metro Atlanta Regional Transit Authority (MARTA) had a car leave the track, causing injuries to 19 people and much embarrassment for the "Official Spectator Transportation System" for the Olympic games. According to local TV news and newspaper reports, the train had stopped before a red signal, apparently on automatic control. The operator called dispatch, requesting permission to go to manual. Permission was granted, and the operator proceeded through the red signal -- setting off alarms. The train was stopped and put into reverse. As one of the middle cars passed over a crossover switch some or all of its wheels were lifted and displaced. The train stopped very suddenly, tossing the operator and 18 passengers from their seats.54
Trains fail to trigger computerized crossing gates. The Long Island Rail Road tested three level crossings after a train passed one of them and its driver had noticed that the gates did not operate. These three crossings in Sayville all use the same computer system and are the only such systems on the LIRR. The failure proved to be reproducible at two out of the three.55
Union Pacific rolling (?) stock (Daniel P. B. Smith)
Following Union Pacific's assimilation of Southern Pacific, to form the nation's largest railroad, UP has been unable to accurately track its freight cars, resulting in gridlocks and lost trains - most visibly in the southern corridor from LA to Texas, the Gulf Coast region, and the central corridor from Oakland to Chicago. There are major bottlenecks in LA, North Platte, Chicago, and Houston. Integrating the computer systems was reportedly "more difficult than anticipated."
There are many horror stories, including a load of liquid gas that had "virtually evaporated into thin air by the time it arrived;" it took 51 days to ship a load of plastic resin from Dallas to Forth Worth; a shipment from Memphis to California by way of Little Rock, then Memphis, then Little Rock, then Memphis, then Little Rock, then El Paso... Mr. Lundgren of Englin Cotton Oil Mill reported watching one of his own freight cars on UP tracks barreling past his office. "A few days later, he saw it pass again in the opposite direction." [Culled by PGN from Daniel's submitted item by Anna Wilde Mathes and Daniel Machalaba, Wall Street Journal, Monday, 13 Oct 1997, p. B1, and another detailed item by Carl Nolte and Kenneth Howe, San Francisco Chronicle, 11 Oct 1997, D1. Massive grain backlogs and storage problems were also noted (R 19 43).]
Computer crash impacts Washington DC Metro (Epstein Family) (R 19 50)
According to The Washington Post (17 Dec 1997), a computer failure caused 20-minute delays on Metro's Red Line. "The problem occurred when workers in Metro's downtown central control room tried to add an accessory to the main computer that monitors trains' positions. The computer crashed and came back on line only when the accessory was detached, Metro officials said." No indication what the "accessory" was or why it caused a crash. Washington D.C. Metro stops payments on troubled computer (Scott Lucero) (R 19 71)
The Washington Post (29 April 1998) reported that Washington DC's Metrorail stopped payment on a system that pinpoints the position and operation of every train in the 92-mile system and controls 470 switches and 500 signals. Metro officials say that the system has crashed 50 times in the last 15 months. Screens go black, images jiggle, duplicate train numbers and slow response occur frequently according to officials. According to the Metro General Manager, "First we couldn't get the source code from [the contractor]. Then when we got it, it was in foreign language because they had a contractor work on it overseas... They've had people come and go. There has not been total continuity." A familiar RISK, not having developers close to the system. I used to think that not having the escalators work was a big deal - it appears they've got bigger problems.
Runaway train on Capitol Hill (Thomas A. Russ) (R 20 13)
There is a runaway train on Capitol Hill. The automatic brakes on the Senate subway between the Russell Office Building and the Capitol failed in December 1998, sending the train crashing into a wall and slightly injuring the operator and the two other people on board. In the best congressional spirit, a spokesman for the architect stressed that "there was no operator fault involved. It's all automatic, and it's supposed to stop by itself." [Source: Los Angeles Times, 16 Dec 1998. Especially intriguing are the spokesman's comments. There is also the nagging question of why there is an operator on a fully automated system in the first place. TAR]
Taipei subway computer crash. Taipei's only subway-line service was completely disrupted on June 3, 1996, due to the simultaneous shutdown of both the main computer and the backup system. At 9:27 a.m. that morning, the main control computer suddenly printed out 14 pages of extraneous program code. Eight minutes later, both the main control computer and the backup system went down. The control center ordered an emergency shutdown of the entire system (without incident). Maintenance engineers, with the help of a Matra engineer, were unable to reboot either system. Digital engineers arrived shortly and discovered that one of the rebooting programs was missing. They reloaded the rebooting program from backup media, and the subway system returned to normal after four hours and thirty-four minutes. (Incidentally, Matra also made the software for Ariane-5, whose crash the very next day is noted in Section .)56
BART ghost train, software crash, system delays; old cable. Bay Area Rapid Transit (BART) had another bad day on December 19, 1996. At 7am, a ghost train appeared in the computer system at the San Francisco 24th Street station, requiring manual operation through that station. Independently, three trains had to be taken out of service because of mechanical problems. All of this caused a 15-minute delay systemwide. Later, a computer crash caused delays up to 30 minutes systemwide, from 5:50 p.m. to 9:45 p.m.
BART also had a serious power cable outage in the transbay tunnel on December 12, 1996. That cable problem was traced to sloppy maintenance after the cable was damaged when it was initially installed in the early 1970s. BART management observed that prior to that outage a complete cable overhaul had been considered to be an urgent step in upgrading the aging infrastructure.57
Channel Tunnel Syndrome: unexpected ghost trains and emergency stops. On June 8, 1994, a train traveling through the Channel Tunnel from England to France was evacuated after an emergency light came on in the driver's cabin. The drivers of the 10 lorries on the train were evacuated to the English end of the Tunnel, through the access tunnel. This was the first would-be emergency on the Chunnel (which officially opened in May 1994), although it turned out to be a false alarm.58
Apparently, unanticipated high levels of sea water on the tracks in the Chunnel triggered alarms to drivers and train controllers, forcing an emergency stop and manual inspection -- typically causing a 20 minute delay. In April 1995, they averaged about 5 emergency stops a week, out of 100 trains a day. The action of the train at 100 mph going through the chunnel raises a mist of salt water behind it, which short-circuits a low-voltage connection between the rails, and mimics the appearance of a train on the tracks. It appears that engineers have underestimated the effect of sea water, an excellent conductor of electricity, on trackside electronic equipment. John Wodehouse suggests they may also have underestimated the corrosive effects of salt water.59
Phantom trains down Miami's Metromover inner loop. The downtown 1.9-mile inner loop of Miami's Metromover was closed for more than two days because of "phantom" trains on the track, until the afternoon of April 26, 1995. Metro-Dade Transit Agency technicians attributed the problem to a faulty transmitter in a computer. "Phantom" trains have been a recurring Metromover glitch, one of a long string of computer-related problems plaguing the system and that are likely to continue. MDTA disclosed that in the spring and fall of 1994, sunshine sometimes trips safety sensors that detect the presence of trains. Those sensors have been realigned to shield them from the sun.60
Computer crash halts train traffic in 8 states. A computer crash caused various effects on CSX rail service, freezing passenger and freight trains in their tracks for 2 hours in 8 states in the Southeast US, during the evening rush hour on March 27, 1995. This affected 2100 Amtrak passengers and 5000 Tri-Rail commuters in south Florida, and freight trains from Louisiana to North Carolina. Service was restored "under human direction".61
***Section 2.5.1, Insert at end (p.56) before the last sentence.***
The Big One belittled? A similar roller-coaster accident occurred in England on July 7, 1994, on The Big One -- the world's highest and fastest roller-coaster, at Blackpool's Pleasure Beach. Two trains on the new £12 million ride collided 30 feet above ground. Eight passengers had to be cut free, trapped by jammed safety bars. (The bars worked correctly.) 27 people were taken to hospital with minor injuries, while others were treated for shock. One train (going much more slowly than its top speed of 85 miles an hour) collided with the rear of another, which had been slowed by the braking system. Earlier, on the roller-coaster's first day on May 28, 1994, 30 people were trapped 235 feet up after a fault in the computer system.62
Mad-bus disease (Geert Jan van Oldenborgh) (R 19 40)
Nine people were injured, one seriously, when a Dutch long-distance bus suddenly accelerated from the bus terminal behind Eindhoven Central Station, and ran into the station restaurant. The builder acknowledged that these sudden accelerations were a known problem, he suspected that it had something to do with interference on the electronic accelerator pedal by the communications equipment, the 2-way radio, the mobile telephone and/or the little box which operates traffic lights. No technical shortcomings had been found in previous inspections, but the busses still careen out of control every now and then... The worst-affected 22 out of 178 have now been taken out of service. [source: NRC Handelsblad, 25 and 26 sep 1997].
Two out-of-band comments: in case you wondered, a long-distance bus is defined locally as one that goes more than 50km. The linear dimensions of our country are about 200km... Secondly, with regards to the computer-operated storm-surge barrier I reported on earlier, a week later it transpired that the software was not yet ready in fact, and would become operational this autumn. Until then a human would decide when to close off Rotterdam harbour. Fairly typical I assume... GJ
Bright Field crash in New Orleans. According to John Hammerschmidt of the National Transportation Safety Board, preliminary investigations into the freighter Bright Field crashing into the Riverwalk in New Orleans in 1996 suggest that an oil-pump failure caused the ship's computer to automatically reduce speed. A standby pump kicked in, but under reduced power the ship's maneuverability was decreased. The impact cut a 200-foot swath into shops and a hotel condominium complex, and the pedestrian walkway. A language barrier between the Chinese-speaking captain (and crew) and the English-speaking pilot reportedly may also have contributed. The Liberian-registered 69,000-ton ship was not equipped with a U.S.-recommended voice recorder, and a second voice recorder was not functioning. Coast Guard Captain Gordon Marsh confirmed that large ships lose steering power as often as once a week. Michael Quinlan noted, "The captain also acknowledged forgetting he had a computer override button on his console that could have allowed him to bypass the computer and increase the ship's speed and maneuverability."63
The Royal Majesty. The cruise ship Royal Majesty ran aground off Nantucket in 1995. The explanation ultimately given is that the GPS antenna failed and the alarm was not loud enough to alert the crew to switch to Loran.64
Denver hi-tech baggage handling problems. The opening of the new Denver airport was seriously delayed (with losses estimated at $1 million each day), primarily because of difficulties in getting the $200 million automated baggage-handling system to work adequately. With about 100 computers, there were mechanical problems and some software glitches.
*** ADD TO EXISTING ITEM on Seattle Evergreen Drawbridge: After a second incident involving a death, the Evergreen draw span was rebuilt in 1994. The old mechanical system has been replaced by computer controls with a series of safety features that must be manually overseen by the bridge operator.65
Massive failure of Washington D.C. traffic lights. Most of the traffic lights in downtown Washington D.C. went onto their weekend pattern (typically 15 seconds of green per light), rather than their rush-hour pattern (typically 50 seconds of green per light) during the morning rush hour on May 8, 1996. This problem was reportedly caused by a new version of software installed in the central control system. This caused mile-long traffic jams. By the afternoon rush hour, the software glitch had been "fixed". It wasn't clear whether that meant they reloaded the old software or fixed the bug.66
Computer malfunction floods Boulder garages and basements (S.J. Hutto) (R 19 34)
"Officials blamed a malfunctioning computer for five water main breaks late Saturday that cut service to about 40 homes, flooded basements and garages and turned city streets into rushing streams." A computer controlling water pressure gave inaccurate readings, prompting a city worker to open up the mains. [Source: Rocky Mountain News, 25 Aug 1997]
*** Section 2.8, replace last para, Tempo AnDante, p. 67, with
Tempo AnDante? The crawl of two robots. The two Dante robots provide a saga of what can go wrong in a hostile environment. Dante I was descending for exploration inside the Mount Erebus volcano when its fiber-optic control cable snapped only 21 feet from the top of the volcano, immobilizing the robot.67 Dante II, its successor, was much more successful in its August 1994 exploration the volcanic crater of Mount Spurr in Alaska after the 1992 eruption, and determined that the volcano would be safe for humans. However, its descent was marred by falling rocks, mud and snow, prior to which its dish antenna had been chewed on by a bear. It survived a power loss, a dead transmitter, and a moisture-induced short in its power-communication tether. However, its ascent was stopped when one of its octopods failed. A helicopter hoist failed when its tether snapped -- perhaps wrapped around a very sizable boulder. It was finally rescued with human intervention, although with injuries to one graduate student and six of Dante II's legs.68 Five-million-dollar bug? A Tokyo University research team is implanting electrodes in cockroaches to see if their movements can be remotely controlled. However, the controls themselves still have bugs.69
Programmed tunnel-digging robot runs a-muck. A tunnel-digging robot "mole" uses programmed directional coordinates to chew through 70 feet of soil a day. Sewer diggers in Seattle were surprised when the mole did not reappear at its expected exit point, and Anthony Catania was suspicious when his restaurant-supply store began to shake. The misprogrammed mole left a 700-foot hole that had to be filled with concrete, at a cost of $600,000. (The 18-foot-long mole costs $475,000.)70
Electrocauterizer EMI alters pacemaker. Carl Maniscalco reported that an acquaintance had received emergency care in a hospital after accidentally pulling out her dialysis shunt. The attending physician had been informed that she had a pacemaker, but used an electrocautery device in an attempt to stop her bleeding. The electromagnetic interference from the device apparently corrupted the software in the pacemaker. When the problem was finally detected, the manufacturer was able to reprogram the pacemaker, using data transmitted to the still implanted unit as audio tones via a transducer.71
RF EMI turns into pacemaker life-saver. In contrast with the above cases of harmful EMI effects on pacemakers, here is a beneficial one. A 42-year-old man of The Hague in The Netherlands collapsed in front of a swimming pool when his pacemaker failed. A police officer in the vicinity radioed for help - upon which the pacemaker started working again, because of the radio-frequency interference. The officer was able to keep the man alive by using his transceiver until an ambulance arrived.72
More on RFI effects medical equipment. Radio-frequency interference generated by radios and cellphones has also been known to mess up sensitive medical equipment such as heart defibrillators, diagnostic equipment, and even electric wheelchairs. There is a report of an electric wheelchair "zapped by radio waves" that sent its passenger over a cliff.
A 72-year-old man died in an ambulance when the heart defibrillator device he was on failed due to RFI from the ambulance two-way radio. The ambulance manufacturer had replaced the steel roof with a fiberglass dome, and put the antenna on top.
In a case in which diagnostic equipment indicated a man needed a pacemaker, it was later discovered (after the operation) that the diagnosis had been in error because of RFI from a television set in the same room.
A cellphone used by a mother in the front seat of a car affected the ventilator her child was using in the back seat.
In a hospital ward, various ventilator alarms were triggered whenever the handyman keyed his transceiver.73
Harvard Pilgrim HMO scheduling system creates chaos. The scheduling computers of the Harvard Pilgrim Health Maintenace Organization "broke down" on March 4, 1996, and were unavailable for several days. Nonemergency patients needing to make appointments had to wait until the computers were again available. The medical records system was also down for seven hours on March 6. Harvard Pilgrim indicated this was a "standard database problem." (Terrific standard!)74
Millstone 2 safety risks. Northeast Utilities reported that it had failed to follow proper safety procedures on two occasions in April 1994 at its Millstone 2 plant in Waterford. On April 23, an indicator showed that some of the control rods were stuck. The crew concluded that the problem must have been with the indicator and left for the day. When the new crew arrived, they discovered the rods were indeed stuck, but they failed to shut down the reactor as quickly as they should have and underclassified the seriousness of the event. failed a Northeast Utilities test on reactor theory and were removed from duty for training. The utility's report blamed the problem in part on the operators' failure to understand reactor theory and a failure of plant management to "fully appreciate the implications" of the safety-related event and to provide sufficient oversight.
The second incident involved a coolant leak from the plant's reactor. In this case, the operators again underclassified the seriousness of the event. Notification of federal authorities was delayed by 16 hours.75
Xerox machine caused nuclear-power plant emergency halt. One of the Swedish nuclear reactors, Ringhals 4, was automatically shut down during a routine safety check. When the computer safety system noticed that the instructions were incomplete (because a page had been truncated when copied), it shut down the reactor.76
Western U.S. power blackouts. More than a dozen states including California, Oregon, Washington, Utah, Nevada, Wyoming, Arizona, reported power outages on July 2, 1996. At least 11 separate power plants "inexplicably were knocked off line". Later in the day, plants in Rock Springs, Wyoming, and along the Colorado river also went off line.77 On the following day, parts of Idaho were again blacked out.78
It took until July 20, 1996 -- 18 days later -- for the official cause of the July 2 outage to be announced: an Idaho transmission line that short-circuited when electricity jumped from a low-hanging wire to a tree that had grown too close. The tree, since removed, caused a flash-over in an area about 100 miles east of the Kinport substation in southeastern Idaho. The line carried 345 kilovolts.79 Reportedly, the indication of the initial outage was detected but not relayed on to the appropriate authorities -- because the operator could not find the correct phone number.
On August 10, 1996, there were further outages that affected 8 million accounts in 8 states, parts of Canada and Baja, with major outages, including propagating air-traffic effects.
On August 13, 1996, electricity for the city of Palo Alto was shut down due to erroneous signal sent by a neighboring power company in mistaken anticipation of a power surge.80
Misdirected phone call shuts down local power. Mike Winkelman reported that power went out for an hour and one-half for about half of his town of 38,000 when an apparently automated phone call to shut down a power station was directed to the wrong substation.81
Effects of another San Francisco power outage (R 20 11)
At 8:15 a.m. on 8 Dec 1998, a power surge resulted from an attempt to reconnect a San Mateo power substation to the grid after maintenance. Unfortunately, the temporary grounding had not been removed, providing a massive short. This knocked at least two other power plants off line, and affected about 1 million customers in the San Francisco Bay Area - many for two or three hours, some for up to 8 hours. The blackout took down the SFO Airport, the Pacific Stock Exchange, rapid transit, and ATMs, as well as homes, offices, and hospitals. There were reports of people stuck in elevators and problems with home medical equipment. SFO was back up by 9:45 a.m. with emergency generators. SRI International experienced only a power blip, but it was enough to wipe out a bunch of servers throughout the institute; our lab's computers were down for more than two hours. [See a well-informed follow-up discussion by Cathy Horiuchi (R 20 12).]
The widespread consequences of this local outage give us one more reminder (if we need any) of the importance of routine preparedness for foreseeable but not adequately foreseen events. Natural causes tend to surprise us; the possibility of Y2K-related outages should no longer be a surprise.
How a fuse caused a hospital to disconnect from the power grid (Joan Grove Brewer) (R 20 11)
In April 1998, the Valley Medical Center in Renton, Washington, attempted to cut over to its new power cogeneration plant, independent of the local utility's power grid. The staff was apparently not adequately prepared, because it had assumed the cutover would be seamless. Initially, the hospital indeed ran smoothly, but then lights began to flicker, ventilation fans cut out, alarms beeped, and computer screens blinked on and off. [Source: How a $5.9 million power plant brought a hospital to its knees, by Byron Acohido, Seattle Times staff reporter, The Seattle Times, 6 Dec 1998, PGN Abstracting] Power outage leaves hospitals in the dark (Dave Weingart) (R 20 25)
On 10 Mar 1999, two of the three hospitals that make up Long Island Jewish Medical Center in Long Island, NY were without power for a period of 47 minutes, starting at 5:58pm. Patient care was apparently not impacted, although 2 operations were completed by battery-operated lights, and bags of ice were hauled from the cafeteria to the blood bank to keep things cold. Life-support equipment has an internal battery backup and kept functioning during the outage. Investigations are underway to determine why none of the four backup generators worked.
Kids, let this be a lesson. It's not enough to have a backup system in place; you need to make sure it will work when needed.
Rats take down Stanford and Net connections. Stanford University was without power on October 10-11, 1996, because of a-gnawing rats and a subsequent explosion. This outage also disrupted the BBN Planet Internet hub, affecting Net connectivity for many Silicon Valley companies, and the Websites of the Los Angeles Times and San Francisco Chronicle.82
Rat brings down U.C. Berkeley campus. The entire campus of the University of California at Berkeley was blacked out for almost 6 hours on August 12, 1994, when a rat bridged a power connector. Backup facilities were able to provide limited emergency power during that period.83
Squirrels of the World, Unlight. SRI International experienced its fourth recorded squirrelcide -- which brought down the entire institute on October 12, 1994, for something like eight hours, and created all sorts of internal power surges, despite the isolation supposedly provided by our cogeneration plant hookup. My monitor was fried.84 A 5th squirrelcide at SRI subequently caused an 18.5-hour institute outage, knocking out both utility and cogeneration power (R 19 96).
Another squirrel tail in Washington State. A squirrel shorted itself between 69,000-volt and 12,000-volt lines on December 13, 1995, and brought down the "high-tech financial hub of Southeast Washington" - affecting 4000 downtown customers, and causing an explosion and fire inside Pacific Power's central substation.85
Squirrels bring down Nasdaq. Nasdaq trading was shut down by an energetic squirrel who apparently chomped on a power line near the stock market's computer center in Trumbull, Connecticut on August 1, 1994. The system failed to perform the automatic switchover to the temporary backup power supply (designed to last until the backup system in Rockville, Maryland, could be brought up), and consequently the market was down for 34 minutes. A similar problem occurred in December 1987.86
Snail causes Liechtenstein's cable TV system to fail. Soccer fans in Liechtenstein were unable to watch the final minutes of a soccer match between the French team of Auxerre and Switzerland's Zürich Grasshoppers when a snail crawled into a socket. The resulting short-circuit caused the entire cable TV network of Liechtenstein to fail in October 1996.87
"Buffer overload" crashes network bridge. Jeff Anderson-Lee reported on the custodians at Berkeley during the summer of 1996 plugging in their heavy-duty floor buffers, which tended to blow the archaic circuit wiring. Instead of resetting the breaker, they kept trying other outlets. As a result, the network bridge on that circuit was put out, and the two halves of net were cut off from each other half. The custodian who had been trained not to do this was on sick-leave.88
$25m Australian power system runs amok. Failure of an automated system for a Queensland power station (requiring twice as many engineers as the previous system) caused more than $1.5 million damage to machinery at the Swanbank station near Ipswich when the system failed to prevent a trip (shutdown) cutting oil flow to a turbine, which resulted in a bent shaft and left the turbine with reduced generating capacity. An automatic alarm system failed, almost two years after it was installed. Improper testing and waiving of commissioning and acceptance testing were implicated.89
Power outage in Russian missile site The plug was reportedly pulled at a major Russian missile site, because their electric bill had not been paid.90
As the year 2000 approaches, the risks of calendar-clock problems looms large whenever two-digit year fields are used. The number 99 is larger than 00, not smaller, and we can expect all sorts of computer calendar date-time arithmetic to fail whenever the relative order of dates is considered. For example, COBOL programs use a two-digit year field, and COBOL programmers are increasingly scarce. Consequently, some folks are in panic, whereas others have a while longer to plan ahead (MS-DOS bellies up on 2048 Jan 01 and the programming language Ada has a time_of year field that is exhausted after 2099). Some folks believe they are really immune - such as users of Java, which runs out of dates in the year 292271023. As noted in Section , some systems have already run out or will soon (Tandem CLX; Apollo workstations exhaust their time fields on November 2, 1997; the Global Positioning Satellite GPS on August 21, 1999; Ed Ravin noted the Fujitsu model SRS-1050 ISDN display phones had their clocks stop at 1994 Sep 30 11:59 PM.91), some later. Pundits are creating estimates of how much it will cost to fix all of the software that is expected to die, beginning at the transition on midnight from 1999 Dec 31 to 2000 Jan 1. A figure of $300 to $600 billion (thousand million) has often been quoted as the estimated worldwide cost. $30 billion was cited as the cost to the U.S. Government, with the prognostication that 30% of the systems would not be fixed in time. Consumers Power Co. in Michigan estimated that their upgrade (begun in 1993) would cost up to $45 million. The average Fortune 500 company was expected to spend $100 million.92
Some effects were already being felt at the end of the 1990s, as systems were unable to handle expiration dates into the the 2000s. Scot E. Wilcoxon noted that a Minneapolis newspaper pointed out that five-year planning programs were already at risk in 1995. John Cavanaugh recalled seeing a Computerworld article in 1975, when some programs that did projections 25 years ahead started failing.93 In the United Kingdom, the Department of Social Security in 1996 postponed the ability of divorcing couples to split their pensions until the year 2000, because of the effects on the computer databases.94
Some lawyers are drooling over the expected lawsuits. Some hucksters are selling easy solutions. There is even a report of a Year-2000 Shark who was scamming businesses by offering to fix credit-card systems that allegedly would not work on a card with a year-2000 expiration date.95
*** ADD TO LEAP-YEAR SECTION ***
Leap-day 1996 in New Zealand. A computer glitch at the New Zealand Aluminium Smelter plant at Tiwai Point in New Zealand (South Island) at midnight on New Year's Eve 1996 left a repair bill of more than NZ$1 million. Production in all the smelting potlines ground to a halt at midnight, when the computers unexpectedly all shut down. General manager David Brewer said the failure was traced to a faulty computer software program that failed to account for 1996 being a leap year: the computer was not programmed to handle the 366th day of the year. The same problem occurred two hours later at Comalco's Bell Bay smelter, in Tasmania, Australia. (New Zealand is two hours ahead of Tasmania.) Both smelters use the same program, which was written by Comalco computer staff. Before the Tiwai problem could be fixed that afternoon, five cells had over-heated and were damaged beyond repair. Mr. Brewer estimated the replacement cost at more than NZ$1 million.96
All the News That Fits We Print: No-Op-Ed. On July 10, 1995, Simson Garfinkel gave me a copy of The New York Times Op-Ed page from that day's National Edition. The page was mostly blank, with a nicely black-boxed obit-like message: "TO OUR READERS, Because of a computer breakdown, some copies of The Times were printed without the Op-Ed page."97
Logic flaws. There was lengthy discussion in the on-line RISKS relating to the Pentium floating-divide chip flaw that resulted from a table incorrectly copied from the Intel 486 chip design.98 A flaw in the Intel Orion 82450 chipset (an auxiliary to the Pentium Pro) was also discovered, although it affected performance and not correctness.99 Jim Haynes recalled earlier floating-point flaws in the early VAX 11/780s and the General Electric 635.100 Chris Phoenix noted a software flaw in the built-in BASIC on the TI 99/4A computer.101
Microsoft mathematics bugs: Calculator and Excel. For several years a mathematics flaw existed in the Calculator applet that came bundled with Microsoft Windows. This remained uncorrected in several releases over a considerable period of time. A new flaw surfaced in Microsoft's Excel spreadsheet: type or paste 1.40737488355328 into a cell and you will be rewarded, not with the number you expect but with 0.64. If you perform arithmetic with this, it will act as if 0.64 had been entered -- so it is not simply a display error. When the number is used as part of a formula, the error is not apparent.102
NEW SECTION, COMBINING WHAT IS CURRENTLY IN SECTION 5.7 on accidental financial losses. Social Security Administration problems. The SSA botched a software upgrade in 1978 that resulted in almost 700,000 people being underpaid an estimated $850 million overall, as a result of cutting over from quarterly to annual reporting.103 Subsequently, the SSA discovered that its computer systems do not properly handle non-Anglo-Saxon surnames (for example, with spaces as in de la Rosa, or that do not appear at the end, as in Park Chong Kyu) and married women who change their names. This glitch affected the accumulated wages of $234 billion for 100,000 people, some going back to 1937.104
Glitch causes 4 billion euro overdraft (Monty Solomon) (R 20 30)
Although the January switch to the single European currency was smooth at most European banks, a prominent German discount bank and its customers this week were acutely aware that not all possible euro-caused glitches have been found. Customers of Bank 24, a discount bank owned by Deutsche Bank AG, were astonished [on 6 Apr 1999?] to find that their securities accounts appeared to be overdrawn to the tune of 4 billion euro ($4.32 billion). An oversight connected to the change to the euro was responsible for the error, affecting 55,000 customers. (Source: Mary Lisbeth D'Amico, IDG, 12 Apr 1999)
Nasdaq Computers Crash. The U.S. automated over-the-counter Nasdaq marketplace went down for 2.5 hours on the morning of July 15, 1994, when the computer system died. (It was finally restored just before N.Y. lunchtime.) The problem was traced to an upgrading to new communications software. One new feature was added each morning, beginning on Monday. Thursday's fourth new feature resulted in some glitches, but the systems folks decided to go ahead with the fifth feature on Friday morning anyway -- which overloaded the mainframes (in Connecticut). Unfortunately, the backup system (in Rockville, MD) was also being upgraded, in order to ensure real-time compatibility. The backup died as well. The backup system is "really for natural disasters, power failures, hardware problems that sort of thing," said Joseph R. Hardiman, Pres and CEO of Nasdaq. "When you're dealing with operating software or communication software, it really doesn't help you." Volume on the day was cut by about one third, down from a typical 300 million shares. The effects were noted elsewhere as well, including several stock indexes, spreading to the Chicago options pits, trading desks, and the media. That in turn affected the large stock-index mutual funds.105 (Squirrel-caused Nasdaq outages are noted in Section 2.10.2.)
NY Stock Exchange halted for one hour. The New York Stock Exchange opened an hour late on December 18, 1995, after a weekend spent upgrading the system software. At 9:15 a.m. on Monday, it was discovered that there were serious communications problems in the software between the central computing facility and the specialists' displays. The problem was diagnosed and fixed by 10:00 a.m., and the market reopened at 10:30 a.m. It was the first time since December 27, 1990, that the exchange had to shut down. The Chicago Mercantile Exchange, Boston Stock Exchange, and Philadelphia Stock Exchange all waited until the NYSE opened as well. (The monster snow storm on January 8, 1996 subsequently caused a late start and an early close.)106
Alberta Stock Exchange shutdowns. For the second time in six sessions and the third time in 1997, the Alberta Stock Exchange lost a day of trading on March 11, 1997, because of system problems. Fixing the software took all day and night. Previous software errors had stopped trading all day on March 4, and earlier, in May 1996 and January 1997.107
Johannesburg Stock Exchange computer failures. The Johannesburg Stock Exchange's automated trading system (JET) began fully automated trading on June 10, 1996. A failure on July 1 was attributed to "human error." On July 22, 1996, the system and the backup system both failed after only forty minutes trading and were down for the rest of the day.108
Washington Post runs old stock prices; file-name confusion. The Washington Post printed a full page of old stock prices in their business section in late December 1994, because a space was left out of a file name.109
NASD loses records on 20,000 brokers. The National Association of Securities Dealers (NASD) is the self-regulatory organization that oversees broker-dealers and their employees in the United States. It maintains a database of brokers and any disciplinary actions taken against them. Unfortunately, 20,000 records were accidentally purged from their files, and there was no backup file.110
Computer glitch gives Schwab investors instant loss of balance. A program error caused Schwab's computer systems to omit a significant number of mutual funds when investors used Telebroker to track holdings by phone, leading some of them to believe themselves broke. The problem existed for two days, scaring scores of investors. Janus, Putnam, and Schwab's own funds were among those omitted from net asset calculations.111
Rough days on the stock markets (PGN)
With the huge fluctuations in stock prices on 27-28 Oct 1997, the NYSE and Nasdaq each handled over a billion shares for the first time ever on 28 October 1997, with the NYSE at 175% of the previous blockbuster day. The bad news is that those folks who relied on the Internet to do their panic trading were in for a rough time. There were huge numbers of e-trades already queued up before opening, causing an early traffic jam. Joseph Konen of AmeriTrade Holding blamed some of the delays on limitations of its firewall technology. Many would-be Internet buyers and sellers simply could not get access, in part because their Internet service providers were saturated. Many customers were blocked out because others were tying up lines just to monitor the market. (Illustrating the extent to which Internet trading has become a part of the markets, Schwab normally does 35 percent of its trading on-line; yesterday's trading of more than 300,000 on-line transactions more than doubled their Monday load and tripled their typical day.) Conventional trades were also affected. [Steve Bellovin, Frank Carey, and Nick Bender gave lots of details, including Nick noting the effects on Nasdaq of a sequence-number overflow from 999,999 to 000,000 (R 19 44).]
Chemical Bank's ATMs go down after snafu. Chemical Bank's ATMs were out of commission for more than five hours, beginning at 6:45 a.m. on July 20, 1994. A routine file update was botched, overloading the computer system. This came six months after Chemical systematically charged its customers twice for cash withdrawals.112
Patched software threatens $26 billion federal retirement fund. Inadequate configuration control often presents serious risks. "An audit of the $26 billion federal employees' Thrift Savings Plan found that ineffective control of software development has left the plan vulnerable to processing interruptions and may have compromised its data integrity." The audit found that between 1990 and 1993, more than 800 changes were made annually to the software; about 85 percent of 1993 updates, mandated or emergency changes, bypassed upfront quality assurance database testing; comprehensive quality assurance testing was rarely performed; six programmers (17 percent) accounted for more than 40 percent of all 1992 and 1993 TSP software changes, for which there was little documentation.113
Fidelity Brokerage computer problems. Fidelity Brokerage Services (a discount stock brokerage in London) rushed a new system into operation in April 1996 without adequate testing. As a result, they had more than 50 people working 14-hour days to sort through and manually correct three months of records ("late bookings of dividends and other problems"). British authorities forced FBS to stop taking new customers until the problems were solved.114
Interac. On November 30, 1996, the Canadian Imperial Bank of Commerce Interac service was halted when an attempted software upgrade failed, affecting about half of all would-be transactions across eastern Canada.115
German Railway payroll software glitch. The German Bundesbahn failed to meet its payroll correctly for four months running, because of software problems resulting from new pay adjustments in the privatized rail system. Some people didn't get paid at all.116
Bank goof creates millionaire. Howard Jenkins was a multimillionaire for about a half a day, when an ATM machine gave his balance in the tens of millions. Apparently, an error had resulted from his requesting a hold on his account after he lost his checkbook. He withdrew $4 million in cashier's checks and cash, but returned later with his lawyer to returned the money. The bank blamed a computer error.117
ATM problems in Canada. Toronto-Dominion Bank's automated teller machines crashed for most of the weekend in October 1996, affecting 2000 ATMs. Their debit payment system was also down.118
Chase Manhattan computer glitch affects thousands. As a result of a few missing keystrokes that would have properly defined the recipient list, about 11,000 out of 13,000 of Chase Manhattan Corp.'s secured credit-card customers received a letter intended for just 89 customers -- informing them that their accounts were in default and could not be converted to unsecured accounts. The screwup was blamed on an outside firm that administers the secured card program.119
Barclays Bank banks big-bang bump-up.
In one of the rare success stories that can be found in this book (primarily
because there are so few to report), Barclays Bank shut down its main
customer systems for a weekend, and seamlessly cut over to a new distributed
system accommodating 25 million customer accounts on Monday. This system
replaced three incompatible systems. It is rumored that Barclays spend at
least £100 million on the upgrade.120
Cases of accidental financial losses are given in Section , following intentional financial frauds in Section .
Intuit tax glitches. Flaws were reported in the PC and Mac versions of TurboTax and MacInTax 1040 for 1994. These flaws were triggered when transferring tax data to the tax package from other software, such as Quicken. Intuit Inc. estimated that the flaws would affect only about 1% of the users.121 Intuit also had a security flaw that could have enabled one user to download another taxpayer's returns, because the password for the Intuit master computer was embedded in a visible debug file.122
Tax preparation programs. PC Magazine did a comparison of twenty different tax-return packages and discovered that each one computed a different total tax due for the identical input data.123
Microsoft and Lotus spreadsheet errors. Microsoft and Lotus Development have admitted that their spreadsheet products may produce inaccurate results because of an inherent problem with the design of all computers (base conversion, rounding, etc.). Mistakes can occur in precision calculations, of the kind required by engineers and users in the scientific, banking and finance sectors. 124 Steve Bellovin recalled Fred Brooks describing a 1950s program for billing by petroleum usage, where the billing was legally constrained to conform to certain tables - which were incorrect. The solution was to compute another table defining the differences between the computed values and the legally required ones.125
Maryland Lottery Computer Glitch. A software error was blamed for two of the six winning numbers being reported incorrectly to 3,800 lottery outlets. Many people threw out their tickets thinking they lost, while others thought they had won.126
An unlosable casino game. Erling Kristiansen noted that the Dutch radio station Radio 538 set up a "Virtual Casino" on their web server, as a protest against legislation-in-the-making against Internet gambling. Playing is free of charge, and you can win real prizes, presumably paid by the sponsors whose company logos appear prominently. However, if you lose in a turn of the game, you just click on back on your Web browser, and you undo your loss! This reminded Hal Lockhart of computer-based gambling games that forget to check for negative bets: you make a negative bet and lose on purpose, and the game subtracts your bet from your winnings - that is, adds the absolute value of the bet!127
California lottery glitch. The California Lottery started issuing tickets for the following lottery 3 hours early on May 14, 1995, causing anger and confusion. An employee of Sacramento's GTECH, which runs the lottery computer, was conducting routine maintenance when he mistakenly entered a command that closed the draw pool. Lottery officials wisely decided that tickets in that 3-hour window would be eligible for both lottery drawings.128
U.K. lottery terminals downed by satellite network breakdown. National Lottery computer outlets crashed for 15 minutes throughout the United Kingdom on June 10, 1995, when part of the satellite network broke down before noon, ahead of the evening's expected record £20 million jackpot prize-draw.129
Ben & Jerry's first-ever loss. Ben & Jerry's Homemade Inc. reported its first quarterly loss (Q4 1994) since it went public -- due in part to recurring problems in their computerized handling system that delayed the opening of a modernized plant in St. Albans, Vermont.130
*** FOLLOWING EARLIER CASE, p.191. *** Enormous water bills - GIGO strikes again. James M. Politte of a Warrensburg, Missouri, reported receiving a water bill for $4,704.88. The water meter had been replaced with a new one, and being new, it read "000000". The previous reading from the last month, was "017060". The computer of course assumed that numbers on a water meter only go up, and thus assumed that "000000" was caused by the meter rolling-over after reaching "999999".131
The Absence of Good Software-Engineering Bites Security AgainSteve Bellovin noted at several recent meetings that 8 out of the 13 CERT
Advisories issued during 1998 involved security vulnerabilities caused by
buffer overflows. That alarming ratio deserves greater attention. CERT
Advisory CA-99-03 on FTP-Buffer-Overflows continues that tradition: "Remote
buffer overflows in various FTP servers leads to potential root compromise"
from Netect, Inc.
http://www.cert.org/advisories/CA-99-03-FTP-Buffer-Overflows.html .
Gee whiz, folks, buffer and stack overflows have been with us for years.
For example, Robert Morris's Internet Worm exploited one in 1998. "When
will they ever learn?"
Electromagnetic interference
Electromagnetic interference on defense systems Patriot defenses and Predator unmanned aerial vehicles reportedly cannot work properly in certain foreign countries (Germany, Japan, South Korea and Bahrain are particular instances) because of frequency clashes. For example, Patriot missile system radios, radars, and data-link terminals clash with Korean cellular phones; U.S. force pagers clash with Japanese aeronautical systems; crib monitors used on U.S. bases clash with German telephone service. In Bahrain, SPS-40 and SPS-49 radars are unusable because of interference from the national telecommunications services. (See the Defense Week issue that came out on 26 Oct 1998.) "At least 89 telecommunications systems ... were deployed within the European, Pacific and Southwest Asian theaters without the proper frequency certification and host-nation approval." as noted by Roy Rodenstein, who reminds us of the HDTV interference with Baylor hospital equipment (R 19 62), and points out that quasi-ad-hoc spectrum use must be stemmed in the light of ever increasing uses of the spectrum.] Add to the end of Section 3.7, Classical Security Vulnerabilities
Limitations of cryptographic algorithms. Cryptographic systems are sometimes broken because of inadequate strength of algorithms or flaws therein. For example, the Netscape Commerce Server software uses 40-bit RC4 crypto to encrypt customer transaction data. Two efforts -- Damien Doligez (a French student) and a British team -- were independently able to crack the crypto, over the same period of time. It took the French student 8 days using 120 workstations and two parallel supercomputers to search exhaustively for the key - about what is predicted as 64 MIPS-years of processing.132 Subsequently, an MIT undergraduate, Andrew Twyman, used a single $83,000 ICE graphics computer to exhaustively attack Netscape's 40-bit encryption. It was reported that the cost to crack Netscape's exportable crypto thereby falls from $10,000 to $584 a pop.133
In response to a series of cryptography challenges announced by RSA Data Security, Ian Goldberg cracked 40-bit RC5 in 3.5 hours, using 250 machines to exhaust 100 billion would-be keys per hour. Germano Caronni cracked the 48-bit RC5 in 312 hours, using 3,500 computers to search 1.5 trillion keys per hour. The 56-bit DES challenge was broken after 4 months, by exhaustive search.134
Flaws in cryptographic implementations and embeddings. In most cases, it is easier to subvert a cryptographic system without breaking the algorithm or exhaustive search - typically because of weaknesses in the implementation or in the underlying operating systems. For example, two Berkeley computer-science graduate students identified a security flaw in the Netscape browsing software, exploiting the predictability with which a pseudorandom-number generator created the crypto seed (a unique offset). Knowledge of this weakness enables the key to be reverse-engineered with significantly less than exhaustive effort.135 A similar problem was discovered in Kerberos Version 4.136
Paul C. Kocher [20] described an attack that exploited the timing behavior of various cryptographic implementations including RSA, Diffie-Hellman, and the Digital Signature Standard (DSS), from which secret keys can be derived. This is a truly fascinating piece of work.137
Attacks on cryptographic implementations were also described by others as well, particularly involving smart-cards. Boneh, DeMillo, and Lipton explored the effects of introducing random faults through electromagnetic interference, and discovered they could determine private keys of public-key cryptosystems. Ross Anderson described similar attacks on smart-cards. Biham and Shamir found effective differential fault-induced analyses of symmetric cryptographic systems, including DES, triple DES, RC4, and IDEA.138 In addition, there are some rather efficient potential man-in-the-middle attacks on a variety of well-known authentication protocols (Sarvar Patel).
Risks in cryptographic key management. A 1996 National Research Council study report [14] Cryptography's Role In Securing the Information Society (a.k.a. the CRISIS report) presents a comprehensive review of U.S. cryptographic policy and an analysis of the risks associated with bad crypto and good crypto. A subsequent report authored by 11 cryptographers and computer scientists (Hal Abelson, Ross Anderson, Steve Bellovin, Josh Benaloh, Matt Blaze, Whit Diffie, John Gilmore, Peter Neumann, Ron Rivest, Jeff Schiller, and Bruce Schneier), The Risks of Key Recovery, Key Escrow and Trusted Third-Party Encryption [], is also an important document.
RSA's RC5-56 challenge cracked by Bovine Cooperative (David McNett, R 19 43)
"It is a great privilege and we are excited to announce that at 13:25 GMT on 19-Oct-1997, we found the correct solution for RSA Labs' RC5-32/12/7 56-bit secret-key challenge. Confirmed by RSA Labs, the key 0x532B744CC20999 presented us with the plaintext message for which we have been searching these past 250 days.
The unknown message is: It's time to move to a longer key length
In undeniably the largest distributed-computing effort ever, the Bovine RC5 Cooperative (http://www.distributed.net/), under the leadership of distributed.net, managed to evaluate 47% of the keyspace, or 34 quadrillion keys, before finding the winning key. At the close of this contest our 4000 active teams were processing over 7 billion keys each second at an aggregate computing power equivalent to more than 26 thousand Pentium 200s or over 11 thousand PowerPC 604e/200s. Over the course of the project, we received block submissions from over 500,000 unique IP addresses. [...] Adam L. Beberg - Client design and overall visionary; Jeff Lawson - keymaster/server network design and morale booster; David McNett - stats development and general busybody"
Commerce Secretary calls U.S. encryption policy a failure (Edupage)
Distancing the Commerce Department from the position held by the Federal Bureau of Investigation, Commerce Secretary William M. Daley says that the Clinton Administration's controls on encryption technology are hurting America's ability to compete with other countries. "There are solutions out there. Solutions that would meet some of law enforcement's needs without compromising the concerns of the privacy and business communities. But I fear our search has thus far been more symbolic than sincere... The cost of our failure will be high. The ultimate result will be foreign dominance of the market. This means a loss of jobs here, and products that do not meet either our law enforcement or national security needs." (The New York Times, 16 Apr 1998; Edupage, 16 April 1998)
Ron Rivest's nonencryptive Chaffing and Winnowing (Mich Kabay)
Ronald Rivest has posted an interesting new model for maintaining
confidentiality without using encryption:
Ronald L. Rivest, Chaffing and Winnowing: Confidentiality without Encryption,
MIT Lab for Computer Science, 22 Mar 1998.
See <http://theory.lcs.mit.edu/~rivest/chaffing.txt> for full details.
The method has the following key points: Sender and receiver desiring confidential communications agree on a basis for computing message authentication codes (MACs). Sender breaks message up into packets and authenticates each packet using the agreed-upon MAC algorithm. Sender introduces plausible "chaff" text, compar