Safety Messages

Lessons From Challenger

January 04, 2021

This Jan. 28, marks the 35th anniversary of the Challenger accident. The loss of the crew was a tragedy felt by their families, friends and coworkers at the agency, as well as people throughout the world.

The Challenger accident taught us tough lessons and brought forward what have become recognizable phrases: normalization of deviance, organizational silence and silent safety program. Sadly, we learned these lessons again in 2003 with the loss of Columbia and her crew. This shows how vital it is that we pause to revisit these lessons and never let them be forgotten. We cannot become complacent.

In this month's Safety Message, Harmony Myers, director of the NASA Safety Center, discusses the Challenger accident and the lessons it continues to teach us today.

Presentation

Reminders to Keep You Safe

Welcome to the Office of Safety and Mission Assurance Safety Message archive. This page contains Safety Message presentations and related media. While some of these presentations are not NASA related, all of them have certain aspects that are applicable to NASA. I encourage you to disseminate these to your organizations to promote discussion of these issues and possible solutions.

—W. Russ DeLoach, Chief, Safety and Mission Assurance

Lost in Translation

The Mars Climate Orbiter Mishap

August 01, 2009

More than any other mishap we have studied recently, the loss of the Mars Climate Orbiter highlights the need for comprehensive verification and validation. The Mars Climate Orbiter team did not ensure that software matched requirements. Because of this oversight, the team used software that reported the spacecraft's trajectory in English instead of metric units, a discrepancy that should have been caught by rigorous verification and validation. This problem was compounded by miscommunication, invalid assumptions and rushed decisions. On its journey to Mars, the spacecraft drifted away from the flight path its navigators were following. When the Mars Climate Orbiter reached its destination, it entered the Martian atmosphere well-below its intended altitude and disappeared. As we review the Mars Climate Orbiter this month, consider the progress we have made since this 1998 mission failure, but also look for parallel situations in the programs and projects you are working on today.

Case Study Presentation

Triple Threat

Honeywell Chemical Releases

June 01, 2009

They say bad luck comes in threes. During the summer of 2003, that adage appeared to be true for one chemical plant in Baton Rouge, Louisiana. The plant, which makes refrigerant, inadvertently released three hazardous chemicals in one month. The incidents injured eight people, killed one person and exposed the surrounding community to chlorine gas. Despite the adage, more than bad luck produced this difficult month at the plant; a web of organizational issues contributed to the incidents. Although we do not manufacture refrigerants at NASA, we look to this story this month to learn about findings that are applicable to any pressure system containing hazardous liquids or gases. These incidents remind us that one unforeseen event can cascade into multiple unintended consequences. A key to combating this situation is thorough planning, which looks at process steps in a context of system knowledge. Carefully conducted hazard analyses, training employees for non-routine situations and respect for written operating procedures are all lessons we can learn from these chemical release incidents.

Case Study Presentation

Shuttle Software Anomaly

STS-126

May 01, 2009

This month we're looking at a recent close call that you're probably not familiar with unless you're part of the shuttle program: a software anomaly that surfaced on the Endeavor's mission last November (STS-126, 11/2008). The anomaly did not endanger the mission — or the astronauts aboard — but it caught my attention because the software assurance process for the shuttle is so rigorous that we almost never experience software problems in flight. The root of the problem we're looking at lies in the evolution of software development conventions and practices. The way software programmers code has evolved over the last twenty years, and a recent software change caused old code that depended on old conventions to fail. This incident points to the dangers of using heritage resources and highlights the key activities that must accompany any modifications to heritage hardware or software. Thorough verification and validation, well-developed processes backed by careful training and obsessive anomaly investigation will help us successfully continue to use the resources we have developed over the last fifty years.

Case Study Presentation

Red Light

Train Collision at Ladbroke Grove

April 01, 2009

With the development of machines and automation to manage nearly everything in our lives, reliance on human initiative and decision is quickly becoming a thing of the past. However, as a result of this change, there is an ever-growing need for humans to effectively interface with machines. This month's mishap addresses the importance of prudent consideration in the design of the human-machine interface. During morning rush hour in London on Oct. 5, 1999, a commuter train passed a red signal into the path of an oncoming high speed train at Ladbroke Grove Junction, killing 31 people and injuring many others. The mishap investigation pointed to several problems related to how the driver and signalers interfaced with the equipment and displays around them. At NASA, when we rely on human action, we must be careful to design for human capability and limitations. We must design systems that consider human expectations and logic. To ensure success, we must supplement these designs with effective training and sufficient experience to enhance the likelihood that the proper actions are taken.

Case Study Presentation

Cover Blown

WIRE Spacecraft Mishap

February 01, 2009

Times of transition often carry additional risk. Spacecraft launch, mission phase transition, system startup — these can be tense moments for NASA projects. This month's mishap illustrates the importance of considering every sequence of mission activities during the design and review process. Just moments after NASA's Wide-Field Infrared Explorer (WIRE) powered-on to begin its infrared survey of the sky, a transient signal from one of its components compromised the mission. The mishap investigation concluded that the team could have anticipated the signal if the review had thoroughly considered the start-up characteristics of WIRE's components. Instead, the design did not account for components' variable start-up times or their dependence on the time components were powered-off. The problem was compounded by a low-fidelity test set-up that led the team to dismiss anomalies during start-up. Testing and design focused on the mission objective but neglected WIRE's crucial transition from "power-off" to fully operational. At NASA, we need to focus closely on these moments of transition.

Case Study Presentation

The Unknown Known

USAF B-2 Mishap

January 01, 2009

In NASA's climate of daring enterprise and unparalleled innovation, our efforts can be foiled by the challenge of simple communication. In this month's case study, the Department of Defense lost a $1.4 billion aircraft because one maintenance technician working on the aircraft was not aware of a workaround developed in the field. The technique was only informally communicated with local personnel and never incorporated into standard procedures. The Air Force investigation concluded that if personnel had had a better understanding of how critical specific systems were to the overall performance of the aircraft, they would have insisted on formally communicating the technique. The only people who truly understood the system interfaces were former B-2 engineers who had designed the aircraft ten years prior to the mishap. The operating organization lacked profound systems knowledge when a new environment required it. At NASA, we must continue to improve our strategies for capturing and transferring knowledge from personnel before they retire or move on to a new project. We must aggressively document changes and workarounds developed in the field. Lastly, we must strive to develop a broader understanding of the systems and programs we work with so we can recognize — and share — critical information. Even brilliant engineering and design will fail without effective communication in the face of change.

Case Study Presentation

Under Pressure

Sonat Explosion

December 01, 2008

In any team with years of experience on long-term projects, complacency can slowly undermine critical task execution. This month's case study illustrates how group acceptance of a system that lacked design documents then precluded hazard identification and elimination. Further, informal operational processes transformed hidden design flaws into deadly high-energy conditions. A team comfortable with work processes that rely on tribal knowledge and verbal instruction will foster errors of omission and commission. Our objective is to encourage use of every opportunity to identify risks, hazards and other elements that could impact safety and quality. We need to develop designs and processes that meet specifications and mission requirements without compromise in safety. Hazards must be relentlessly evaluated and measured not only for their likelihood, but also for their impact on the mission, should exposure to that hazard occur. We must hold each other accountable for effective and continuous communication through our documented processes. Risk of failures increase when we grow comfortable that a mission is routine, and expensive controls or compliance can be stripped away because "we have done this before." Rigor in following formal process, including reviews, audits and evaluations that identify hazards, is a proven individual and team behavior that increases a system's margin of safety.

Case Study Presentation

The Million Mile Rescue

SOHO

November 01, 2008

The Solar Heliospheric Observatory (SOHO) spacecraft completed its primary mission in 1997 and was such a success that it has been extended multiple times currently through 2009. But in its first extension in 1998, we almost lost SOHO due to errors in code modifications that were intended to prolong the lifetime of its attitude control gyroscopes. The command sequence to deactivate a gyroscope did not contain the code to reactivate it. Due to a prioritization of science tasking, an aggressive schedule and limited staffing, these code modifications were not thoroughly tested before implementation. When SOHO started experiencing complications, standard operating procedures were circumvented in order to return SOHO to operational status as quickly as possible. The results were the failure to detect that one of the gyros was inactive, the progressive destabilization of attitude and the complete loss of communications with SOHO. It took us three months of labor-intensive collaboration with the European Space Agency to miraculously recover SOHO. This month's case study shows us how ignored review processes and circumvented operating procedures can severely jeopardize mission success.

Case Study Presentation

That Sinking Feeling

Loss of Petrobras P-36

October 01, 2008

Budgets cuts and downsizing are a reality in every industry, but it is critical to maintain the integrity of safe operations through these times, especially with human space flight. While cost-cutting can drive many innovative solutions, a blind focus on financial performance can compromise safety considerations, forfeit thorough testing and result in poor decisions with catastrophic outcomes. Petrobras management had taken a clear stance to eliminate many standard engineering practices, redefine safety requirements and reduce inspections openly with the goal of improving financials. As a part of efforts to save cost and space on its Platform 36, the Emergency Drain Tanks (EDTs) were placed inside of the two aft support columns adjacent to the seawater service pipes. When one of the EDTs burst from over-pressurization, this set off a violent chain of events, including the rupture of the seawater pipe, massive flooding of the column and an incendiary explosion on the upper levels, ultimately resulting in the total loss of the world's largest offshore oil production platform and the lives of 11 crew members. The investigation found that no design testing or analysis had ever been performed on the EDT configuration. At NASA, we must impose a consistently high level of rigor to all of our testing plans and treat even common reconfigurations the same as we would a new design.

Case Study Presentation

Fender Bender

DART’s Automated Collision

September 01, 2008

In April of 2005, multiple errors in the navigation software code that were overlooked during rushed testing phases caused the Demonstration of Autonomous Rendezvous Technology (DART) spacecraft to crash into the target satellite with which it was attempting to rendezvous. It was operating entirely on pre-programmed software with no real-time human intervention. The same navigational errors resulted in the premature expenditure of fuel and early end to the mission without the $110 million program having completed any of its close-range technical objectives. The DART team did not adequately validate flight critical software requirements, including late changes to the code that proved critical in this mishap. The program used an inappropriate heritage software platform and a team that lacked the levels of experience needed to operate with such little oversight. As NASA continues to push the envelope with newer cutting edge autonomous technologies, we must keep in mind the basic principles that make any technology program successful. Validation, verification and peer review cannot be sacrificed for schedule, and we must fully utilize our past experiences and expertise on current and new projects. We must be careful to ensure that we are not simply automating failure.

Case Study Presentation

1 2 3 4 5 6 7 8 9 10 ...