Fault management (FM) is usually mentioned as the first concern in network management. Its main role is to ensure high availability of a network. Hence, it involves procedures to automatically detect, notify the occurrence of a fault and isolate the root cause (RCA) of the fault.
Below diagram depicts the view of network operations with and without integrated FM.With automated FM system we can integrate and monitor multiple technologies from multiple vendors with limited human resource.
FM is the process of locating problems or faults on the network.
It involves the following steps:
- Discover/Detect the faults
- Isolate the faults
- Fix/Notify/Report the faults
Below diagram shows the FM functionality as part of FCAPS model.
The main functions of FM are
- Event/Alarm Discovery
- Event/Alarm Filtering
- Event/Alarm Correlation (RCA/SIA)
- Alarm Forwarding/Notification
- Alarm Reporting/Analysis
- Third Party Integration
The functions of FM can be broadly divided in to three parts
- Event Collection
- Event/Alarm Processing
- Generating Info/Reports
Event Collection : Connecting and Collecting events/alarms from the various network elements. Suppressing unnecessary events/alarms. Managing the retention of events/alarms.
Event/Alarm Processing: Events/Alarms filtering , Events/alarms thresholding, Enrichment Process. Event/Alarm Correlation ,Event/Alarm Forwarding, Root Cause Analysis (RCA)/Service Impact Analysis (SIA).
Generating Info/Reports: Events/Alarm Reporting , Event/Alarm Analysis, Integrate with other OSS system to generate other information, information forwarding.
What is Fault?
A Fault is a software or hardware defect in a system that disrupts communication or degrades the performance.
What is Event?
An event is a distinct incident that occurs at a specific point in time. Any happening that has an impact on the network performance can be called an event. It can be informational in nature, a cleared event, warning message, a trouble sign or even a critical fault.
All the faults in the system/network are notified as events. Events are the source of information for all the management happenings that take place within the FM system.
Typically an event is associated with an managed object (Ex: ME, PTP, Router, Switch etc..) in which it occurs with a specific event Type at a specific layer rate etc. This combination can be called a AlarmKey i.e All the events associated with same fault will have same alarmKey.
Events also have an associated severity. The common severities are Critical, Major, Minor, Warning and Clear.
Examples of fault/events include:
- Port status change
- Connectivity loss/Fiber Cut
- Device reset/Equipment failures
- Device becoming unreachable by the EMS
What is Alarm?
The life cycle of a fault scenario is called an Alarm. An alarm is characterized by a sequence of related events (having same alarmKey), such as port-down and port-up. The last event in the sequence determines the severity and state of the alarm. An alarm that ends with an event that has a severity of cleared is called a cleared alarm.
One ManagedObject can have many different alarms with different alarmKeys.
A port down event with critical severity results in to Critical Alarm And a port up event comes with cleared severity. This moves the Critical Alarm to Clear Alarm.
Flapping is a flood of event notifications with toggling severity which are related to the same alarm (having same alarmKey). Flapping can occur when a fault is unstable and causes repeated event notifications. Flapping can be indicative of configuration problems, real network problems.
A flapping example is illustrated in below diagram
A sequence of events is identified as flapping if:
- All events share the same alarmKey
- The time interval between consecutive events is less than configured value.
Normally the management systems(EMS/NMS) notifies the events to the interested parties through SNMP/Corba/TCP mechanisms. Events can also be generated by external systems for threshold events.
The event processor listens and parses the event notification messages to get more information about the event and maintains the Event information for further processing
Some of the event properties are
- Event Source – Associated ManagedObject name.
- Event Functionality Type – Alarm (Fault Event), TCA (Performance Event)
- Event Type – ITU-T X.733 Alarm Type (Exa. Communication Alarm , QoS Alarm)
- Event description – Indicates event message
- Event Severity – Severity of the the Event
Event enrichment is the process of the populating additional information about the generated event. This process may need to contact with third-party systems to get the information. This enriched information can be useful during fault resolution.
Event Correlation and Alarms
Event correlation is the process of establishing relationships between network events
- Filter out redundant and spurious events.
- Root cause of faults in a network (RCA)
One important aspect of FM is filtering and prioritizing incoming events to identify the serious events. Based on event information the FM can determine whether the event continue to be processed or is dropped. All unwanted/duplicated events can dropped at event collection or event processing stage.
When an NE on the network is faulty, the management system (EMS/NMS) reports the network events to the FM. Each fault may triggers multiple events/alarms. Some events may by triggered by the same fault, so they are associated with each other. The alarm correlation function can analyze the events and generate single alarm for multiple events.
A failure situation on the network usually generates multiple events, because a failure condition on one device may render other devices inaccessible. The events generated indicate that all of the devices are inaccessible.
Network operators use Root cause analysis (RCA) to investigate the root-cause of events. They can determine which events are root cause and which events are results of that root cause (symptom events) and this enables them to to quickly focus on the events that are causing network problems.
Normally RCA process uses knowledge of the network topology to establish a point of failure and identify symptom events.
RCA algorithms can be rule based, predictive or model-based.
If a device fails, the immediate question that needs to be answered is “what business service did it impact” and what is the cost to my business. This kind analysis is called Service Impact Analysis (SIA). SIA uses RCA information to find out the impacted services/customers.
The main aim of network operator is to shorten the fault resolution time period. So basic fault detection and reporting features may not be sufficient for efficient fault monitoring. FM systems should integrate with third-party system like Trouble Ticketing Systems, Performance Mgmt System etc.. These third-party integration’s help the operators to fasten the fault resolution process.
Some Images are taken from Mr.Riswan ppt from slideshare