Alarm management is the application of human factors (or ‘ergonomics’) along with instrumentation engineering and systems thinking to manage the design of an alarm system to increase its usability. Most often the major usability problem is that there are too many alarms annunciated in a plant upset, commonly referred to as alarm flood (similar to an interrupt storm), since it is so similar to a flood caused by excessive rainfall input with a basically fixed drainage output capacity.
However, there can also be other problems with an alarm system such as poorly designed alarms, improperly set alarm points, ineffective annunciation, unclear alarm messages, etc. Poor alarm management is one of the leading causes of unplanned downtime, contributing to over $20B in lost production every year, and of major industrial incidents such as the one in Texas City. Developing good alarm management practices is not a discrete activity, but more of a continuous process (i.e., it is more of a journey than a destination).
Alarm problem history
From their conception, large chemical, refining, power generation, and other processing plants required the use of a control system to keep the process operating successfully and producing products. Due to the fragility of the components as compared to the process, these control systems often required a control room to protect them from the elements and process conditions. In the early days of control rooms, they used what were referred to as “panel boards” which were loaded with control instruments and indicators. These were tied to sensors located in the process streams and on the outside of process equipment. The sensors relayed their information to the control instruments via analogue signals, such as a 4-20 mA current loop in the form of twisted pair wiring. At first these systems merely yielded information, and a well-trained operator was required to make adjustments either by changing flow rates, or altering energy inputs to keep the process within its designed limits.
Alarms were added to alert the operator to a condition that was about to exceed a design limit, or had already exceeded a design limit. Additionally, Emergency Shut Down (ESD) systems were employed to halt a process that was in danger of exceeding either safety, environmental or monetarily acceptable process limits. Alarm were indicated to the operator by annunciator horns, and lights of different colours. (For instance, green lights meant OK, Yellow meant not OK, and Red meant BAD.) Panel boards were usually laid out in a manner that replicated the process flow in the plant. So instrumentation indicating operating units with the plant was grouped together for recognition sake and ease of problem solution. It was a simple matter to look at the entire panel board, and discern whether any section of the plant was running poorly. This was due to both the design of the instruments and the implementation of the alarms associated with the instruments. Instrumentation companies put a lot of effort into the design and individual layout of the instruments they manufactured. To do this they employed behavioural psychology practices which revealed how much information a human being could collect in a quick glance. More complex plants had more complex panel boards, and therefore often more human operators or controllers.
Thus, in the early days of panel board systems, alarms were regulated by both size and cost. In essence, they were limited by the amount of available board space, and the cost of running wiring, and hooking up an annunciator (horn), indicator (light) and switches to flip to acknowledge, and clear a resolved alarm. It was often the case that if a new alarm was needed, an old one had to be given up.
As technology developed, the control system and control methods were tasked to continue to advance a higher degree of plant automation with each passing year. Highly complex material processing called for highly complex control methodologies. Also, global competition pushed manufacturing operations to increase production while using less energy, and producing less waste. In the days of the panel boards, a special kind of engineer was required to understand a combination of the electronic equipment associated with process measurement and control, the control algorithms necessary to control the process (PID basics), and the actual process that was being used to make the products. Around the mid 80’s, we entered the digital revolution. Distributed control systems (DCS) were a boon to the industry. The engineer could now control the process without having to understand the equipment necessary to perform the control functions. Panel boards were no longer required, because all of the information that once came across analogue instruments could be digitised, stuffed into a computer and manipulated to achieve the same control actions once performed with amplifiers and potentiometers.
As a side effect, that also meant that alarms were easy and cheap to configure and deploy. You simply typed in a location, a value to alarm on and set it to active. The unintended result was that soon people alarmed everything. Initial installers set an alarm at 80% and 20% of the operating range of any variable just as a habit. The integration of programmable logic controllers, safety instrumented systems, and packaged equipment controllers has been accompanied by an overwhelming increase in associated alarms. One other unfortunate part of the digital revolution was that what once covered several square yards of panel space, now had to be fit into a 17-inch computer monitor. Multiple pages of information was thus employed to replicate the information on the replaced panel board. Alarms were used to tell an operator to go look at a page he was not viewing. Alarms were used to tell an operator that a tank was filling. Every mistake made in operations usually resulted in a new alarm. With the implementation of the OSHA 1910 regulations, HAZOPS studies usually requested several new alarms. Alarms were everywhere. Incidents began to accrue as a combination of too much data collided with too little useful information.
Alarm management history
Recognizing that alarms were becoming a problem, industrial control system users banded together and formed the Alarm Management Task Force, which was a customer advisory board led by Honeywell in 1990. The AMTF included participants from chemical, petrochemical, and refining operations. They gathered and wrote a document on the issues associated with alarm management. This group quickly realised that alarm problems were simply a subset of a larger problem, and formed the Abnormal Situation Management Consortium (ASM is a registered trademark of Honeywell). The ASM Consortium developed a research proposal and was granted funding from the National Institute of Standards and Technology (NIST) in 1994. The focus of this work was addressing the complex human-system interaction and factors that influence successful performance for process operators. Automation solutions have often been developed without consideration of the human that needs to interact with the solution. In particular, alarms are intended to improve situation awareness for the control room operator, but a poorly configured alarm system does not achieve this goal.
The ASM Consortium has produced documents on best practices in alarm management, as well as operator situation awareness, operator effectiveness, and other operator-oriented issues. These documents were originally for ASM Consortium members only, but the ASMC has recently offered these documents publicly.
The ASM consortium also participated in development of an alarm management guideline published by the Engineering Equipment & Materials Users’ Association (EEMUA) in the UK. The ASM Consortium provided data from their member companies, and contributed to the editing of the guideline. The result is EEMUA 191 “Alarm Systems- A Guide to Design, Management and Procurement”.
Several institutions and societies are producing standards on alarm management to assist their members in the best practices use of alarms in industrial manufacturing systems. Among them are the ISA (ISA 18.2), API (API 1167) and NAMUR (Namur NA 102). Several companies also offer software packages to assist users in dealing with alarm management issues. Among them are DCS manufacturing companies, and third-party vendors who offer add-on systems.
The fundamental purpose of alarm annunciation is to alert the operator to deviations from normal operating conditions, i.e. abnormal operating situations. The ultimate objective is to prevent, or at least minimise, physical and economic loss through operator intervention in response to the condition that was alarmed. For most digital control system users, losses can result from situations that threaten environmental safety, personnel safety, equipment integrity, economy of operation, and product quality control as well as plant throughput. A key factor in operator response effectiveness is the speed and accuracy with which the operator can identify the alarms that require immediate action.
By default, the assignment of alarm trip points and alarm priorities constitute basic alarm management. Each individual alarm is designed to provide an alert when that process indication deviates from normal. The main problem with basic alarm management is that these features are static. The resultant alarm annunciation does not respond to changes in the mode of operation or the operating conditions.
When a major piece of process equipment like a charge pump, compressor, or fired heater shuts down, many alarms become unnecessary. These alarms are no longer independent exceptions from normal operation. They indicate, in that situation, secondary, non-critical effects and no longer provide the operator with important information. Similarly, during start-up or shutdown of a process unit, many alarms are not meaningful. This is often the case because the static alarm conditions conflict with the required operating criteria for start-up and shutdown.
In all cases of major equipment failure, start-ups, and shutdowns, the operator must search alarm annunciation displays and analyse which alarms are significant. This wastes valuable time when the operator needs to make important operating decisions and take swift action. If the resultant flood of alarms becomes too great for the operator to comprehend, then the basic alarm management system has failed as a system that allows the operator to respond quickly and accurately to the alarms that require immediate action. In such cases, the operator has virtually no chance to minimise, let alone prevent, a significant loss.
In short, one needs to extend the objectives of alarm management beyond the basic level. It is not sufficient to utilise multiple priority levels because priority itself is often dynamic. Likewise, alarm disabling based on unit association or suppressing audible annunciation based on priority do not provide dynamic, selective alarm annunciation. The solution must be an alarm management system that can dynamically filter the process alarms based on the current plant operation and conditions so that only the currently significant alarms are annunciated.
The fundamental purpose of dynamic alarm annunciation is to alert the operator to relevant abnormal operating situations. They include situations that have a necessary or possible operator response to ensure:
- Personnel and Environmental Safety,
- Equipment Integrity,
- Product Quality Control.
The ultimate objectives are no different from the previous basic alarm annunciation management objectives. Dynamic alarm annunciation management focuses the operator’s attention by eliminating extraneous alarms, providing better recognition of critical problems, and insuring swifter, more accurate operator response.
The need for alarm management
Alarm management is usually necessary in a process manufacturing environment that is controlled by an operator using a supervisory control system, such as a DCS, a SCADA or a programmable logic controller (PLC). Such a system may have hundreds of individual alarms that up until very recently have probably been designed with only limited consideration of other alarms in the system. Since humans can only do one thing at a time and can pay attention to a limited number of things at a time, there needs to be a way to ensure that alarms are presented at a rate that can be assimilated by a human operator, particularly when the plant is upset or in an unusual condition. Alarms also need to be capable of directing the operator’s attention to the most important problem that he or she needs to act upon, using a priority to indicate degree of importance or rank, for instance. To ensure a continuous production, a seamless service, a perfect quality at any time of day or night, there must be an organisation which implies several teams of people handling, one after the other, the occurring events.
This is more commonly called the on-call management. The on-call management relies on a team of one or more persons (site manager, maintenance staff) or on external organisation (guards, telesurveillance centre). To avoid having a full-time person to monitor a single process or a level, information and / or events transmission is mandatory. This information transmission will enable the on-call staff to be more mobile, more efficient and will allow it to perform other tasks at the same time.
Some improvement methods
The techniques for achieving rate reduction range from the extremely simple ones of reducing nuisance and low value alarms to redesigning the alarm system in a holistic way that considers the relationships among individual alarms.
This step involves documenting the methodology or philosophy of how to design alarms. It can include things such as what to alarm, standards for alarm annunciation and text messages, how the operator will interact with the alarms.
Rationalization and Documentation
This phase is a detailed review of all alarms to document their design purpose, and to ensure that they are selected and set properly and meet the design criteria. Ideally this stage will result in a reduction of alarms, but doesn’t always.
The above steps will often still fail to prevent an alarm flood in an operational upset, so advanced methods such as alarm suppression under certain circumstances are then necessary. As an example, shutting down a pump will always cause a low flow alarm on the pump outlet flow, so the low flow alarm may be suppressed if the pump was shut down since it adds no value for the operator, because he or she already knows it was caused by the pump being shut down. This technique can of course get very complicated and requires considerable care in design. In the above case for instance, it can be argued that the low flow alarm does add value as it confirms to the operator that the pump has indeed stopped. Process boundaries (Boundary Management) must also be taken into account.
Alarm management becomes more and more necessary as the complexity and size of manufacturing systems increases. A lot of the need for alarm management also arises because alarms can be configured on a DCS at nearly zero incremental cost, whereas in the past on physical control panel systems that consisted of individual pneumatic or electronic analogue instruments, each alarm required expenditure and control panel area, so more thought usually went into the need for an alarm. Numerous disasters such as Three Mile Island, Chernobyl accident and the Deepwater Horizon have established a clear need for alarm management.
The seven steps to alarm management
Step 1: Create and adopt an alarm philosophy
A comprehensive design and guideline document is produced which defines a plant standard employing a best-practise alarm management methodology.
Step 2: Alarm performance benchmarking
Analyze the alarm system to determine its strengths and deficiencies, and effectively map out a practical solution to improve it.
Step 3: “Bad actor” alarm resolution
From experience, it is known that around half of the entire alarm load usually comes from a relatively few alarms. The methods for making them work properly are documented, and can be applied with minimum effort and maximum performance improvement.
Step 4: Alarm documentation and rationalisation (D&R)
A full overhaul of the alarm system to ensure that each alarm complies with the alarm philosophy and the principles of good alarm management.
Step 5: Alarm system audit and enforcement
DCS alarm systems are notoriously easy to change and generally lack proper security. Methods are needed to ensure that the alarm system does not drift from its rationalised state.
Step 6: Real-time alarm management
More advanced alarm management techniques are often needed to ensure that the alarm system properly supports, rather than hinders, the operator in all operating scenarios. These include Alarm Shelving, State-Based Alarming, and Alarm Flood Suppression technologies.
Step 7: Control and maintain alarm system performance
Proper management of change and longer term analysis and KPI monitoring are needed, to ensure that the gains that have been achieved from performing the steps above do not dwindle away over time. Otherwise they will; the principle of “entropy” definitely applies to an alarm system.
- SSM InfoTech Solutions Pvt. Ltd.– Alarm Management System
- EPRI (2005) Advanced Control Room Alarm System: Requirements and Implementation Guidance. Palo Alto, CA. EPRI report 1010076.
- EEMUA 191 Alarm Systems – A Guide to Design, Management and Procurement – Edition 3 (2013) ISBN 978-0-85931-192-2
- PAS – The Alarm Management Handbook – Second Edition (2010) ISBN 0-9778969-2-7
- ASM Consortium (2009) – Effective Alarm Management Practices ISBN 978-1-4421-8425-1
- ANSI/ISA–18.2–2009 – Management of Alarm Systems for the Process Industries
- IEC 62682 Management of alarms systems for the process industries
- Ako-Tec AG – Description of a modern Alarm Management System
- Alarm Management and ISA-18 A Journey Not a Destination
- RFC8632 A YANG Data Model for Alarm Management