Introduction to Network Fault Management

Introduction

Fault management (FM) is usually mentioned as the first concern in network management. Its main role is to ensure high availability of a network. Hence, it involves procedures to automatically detect, notify the occurrence of a fault and isolate the  root cause (RCA) of the fault.

Below diagram depicts the view of network operations with and  without  integrated FM.With automated FM system we can integrate and monitor multiple technologies from multiple vendors with limited human resource.

FMIntro

Fault Management

FM is the process of locating problems or faults on the network. 

It involves the following steps:

  1.  Discover/Detect the faults
  2.  Isolate the faults
  3.  Fix/Notify/Report the faults

FMSteps

Below diagram shows the FM functionality as part of  FCAPS model.

FMFunctions

The main functions of FM are

  • Event/Alarm Discovery
  • Event/Alarm Filtering
  • Event/Alarm Correlation (RCA/SIA)
  • Alarm Forwarding/Notification
  • Alarm Reporting/Analysis
  • Third Party Integration

Concepts

The functions of FM can  be broadly divided in to three parts

  • Event Collection
  • Event/Alarm Processing
  • Generating Info/Reports

FMProcess

Event Collection : Connecting and Collecting events/alarms from  the various network elements. Suppressing  unnecessary events/alarms. Managing the retention of events/alarms.

Event/Alarm Processing: Events/Alarms filtering , Events/alarms thresholding, Enrichment Process. Event/Alarm Correlation ,Event/Alarm Forwarding, Root Cause Analysis (RCA)/Service Impact Analysis (SIA).

Generating Info/Reports: Events/Alarm Reporting , Event/Alarm Analysis, Integrate with other OSS  system to generate other information, information forwarding.

Event/Alarm Management

What is Fault?

   A Fault is a software or hardware defect in a system that disrupts communication or degrades the performance.

 What is Event?

An event is a distinct incident that occurs at a specific point in time. Any happening that has an impact on the network performance can be called an event. It can be informational in nature, a cleared event, warning message, a trouble sign or even a critical fault.

All the faults in the system/network are notified as events. Events are the source of information for all the management happenings that take place within the FM system.

Typically an event is associated with an managed object (Ex: ME, PTP, Router, Switch etc..) in which it occurs with a specific event Type at a specific layer rate etc. This combination can be called a AlarmKey i.e All the events associated with same fault will have same alarmKey.

Events also have an associated severity. The common severities are Critical, Major, Minor, Warning  and Clear.

 BasicEvent

 Examples of  fault/events include:

  • Port status change
  • Connectivity loss/Fiber Cut
  • Device reset/Equipment failures
  • Device becoming unreachable by the EMS

What is Alarm?

The life cycle of a fault scenario is called an Alarm. An alarm is characterized by a sequence of related events (having same alarmKey), such as port-down and port-up.  The last event in the sequence determines the severity and state of the alarm. An alarm that ends with an event that has a severity of cleared is called a cleared  alarm.

One ManagedObject can have many different alarms with different alarmKeys.

Example:

port down event with critical severity results in to Critical Alarm  And a port up event  comes with  cleared severity. This moves the Critical Alarm to Clear Alarm.

BasicAlarm

Flapping Events

Flapping is a flood of event notifications with toggling severity which are related to the same alarm (having same alarmKey). Flapping can occur when a fault is  unstable and causes repeated event notifications. Flapping can be indicative of configuration problems, real network problems.

A flapping example is illustrated in below diagram

BasicEventFlap

A sequence of events is identified as flapping if:

  • All events share the same alarmKey
  • The time interval between consecutive events is less than configured value.

Event Discovery/Identification

Normally the management systems(EMS/NMS) notifies the events to the interested parties through SNMP/Corba/TCP mechanisms. Events can also be generated by external systems for threshold events.

 800px-Layerednms

 The event processor listens and parses the event notification messages to get more information about the event and maintains the Event information for further processing

Some of the event properties are

  • Event Source –  Associated  ManagedObject name.
  • Event Functionality Type – Alarm (Fault Event), TCA (Performance Event)
  • Event Type –  ITU-T X.733 Alarm Type (Exa. Communication Alarm , QoS Alarm)
  • Event description  – Indicates event message
  • Event Severity  – Severity of the the Event

Event enrichment

Event enrichment is the process of the populating additional information about the generated event.  This process may need to contact with third-party systems to get the information. This enriched information can be useful during fault resolution.

Event Correlation and Alarms

Event correlation is the process of establishing relationships between network events

 Main Functionality:

  1.  Filter out redundant and spurious events.
  2.  Root cause of faults in a network (RCA)

 Event Filtering

One important aspect of FM is filtering and prioritizing incoming events to identify  the serious events. Based on event information the FM can determine whether the event continue to be processed or is dropped.  All unwanted/duplicated events can dropped at event collection or event processing stage.

FM_EventFiltering

Example:

When an NE on the network is faulty, the management system (EMS/NMS) reports the network events to the FM. Each fault  may triggers multiple events/alarms. Some events may by triggered by the same fault, so they are associated with each other. The alarm correlation function can analyze the events and generate single alarm for multiple events.

RCA/SIA

A failure situation on the network usually generates multiple events, because a failure condition on one device may render other devices inaccessible. The events generated indicate that all of the devices are inaccessible.

Network operators use Root cause analysis (RCA) to investigate the root-cause of events. They can determine which events are root cause and which events are results of that root cause (symptom events) and this enables them to to quickly focus on the events that are causing network problems.

Normally RCA process uses knowledge of the network topology to establish a point of failure and  identify  symptom events.

RCA algorithms can be rule based, predictive or model-based.

If a device fails, the immediate question that needs to be answered is “what business service did it impact” and what is the cost to my business. This kind  analysis is  called Service Impact Analysis (SIA). SIA uses RCA information to find out the impacted services/customers.

Third-Party Integration

The main aim of network operator is to shorten the fault resolution time period. So basic fault detection and reporting features may not be sufficient for efficient fault monitoring.  FM systems should integrate with third-party system like Trouble Ticketing Systems, Performance Mgmt System etc.. These third-party integration’s help the operators to fasten the fault resolution process.

Advertisements
This entry was posted in EMS/NMS/OSS and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s