logo

NJP

Event Management : Leverage Alert Correlation and Grouping for Noise Reduction

Import · May 30, 2020 · article

Introduction:

In this Event management series we are looking at different aspects of event management as shown below. In previous articles we saw event rules and CI Binding ,if you have missed that then please go through them for proper understanding of correlation. Because CI binding also plays an vital role in correlation.

image

In this article we will see the next aspect of event management and that is "Alert Correlation" highlighted in sky blue color above.Before i go in-depth of for this topic i would like to that few people for there efforts and blogs which i took help from and i would like to give them credit. See this below article for other information:

So what i am going to do is show you some use case's for each correlation type and explain how it works. So we will see how the golden circle works in alert correlation.

image

What is alert correlation?

This is process of grouping the alerts logically and classify them as primary and secondary alerts. Alert correlation also helps us to group this alerts into different groups. For example: if there is an alert from SCOM on CI "ABC" and if you see an alert from Splunk on the same CI "ABC" then there will be an automated alert group based on CI.

How alert correlations and grouping happens?

There are multiple ways how the new alert or the reopen alert is correlated with existing alert. ServiceNow does alert grouping and correlation in RAMC order which we will see with some uses cases below. Before we go to use case see below what RAMC stands for. This is really well explained by Aleck in his article.

R - Rule Based (We can configure alert rules as per requirement to decide the primary and secondary alerts using filters, scripts and relationships)

A - Automated (This is automated OOB alert correlation mechanism which works based on CI or Node name)

M - Manual (Self explanatory, where engineer do alert correlation manually by grouping them into few groups)

C - CMDB (Based on your CMDB CI relationships)

Lets look into this one by one and with few real time examples.

1.Rule Based Grouping:

As stated above this should be configured by an developer and as per our own requirement. This can be done under Alert Correlation section in left navigation.

Use Case:

I have a unique case regarding splunk search heads to show, where we have multiple events coming from different search heads for the same node which results into multiple alerts and hence increases noise. In return we need to go and close each alert which is kind of overhead. Now lets see how to correlate this kind of alerts.

Before Correlation Rule Creation:

Below you will see that before creating a correlation rule the alerts where independent and CI/Node is the same. So practically support team needs to work on each alert and acknowledge and solve them which creates more noise and overhead.

image

After Correlation Rule Creation:

We will create alert correlation rule as shown below where primary alert will be the alert coming from Search head one with instance as splunk and secondary alert will be the alert with other search heads for the same CI and Node as selected in relationship type with an interval of 60 min. This means only correlate alerts which are created in last one hour.

image

(Alert Correlation Rule)

Now in this section we will see how the events got created with different sources with respective nodes and alerts.

image

(Events created with Source and Source Instance)

Once the alerts are create and CI binding is done, alert correlations triggers which process the rule based alerts first with order specified on correlation rule, if there is a match then it is applied and no other rule or grouping mechanism is evaluated this is same as assignment rule or event rule. Below you can see that Alert0010144 is primary alert as the source is Splunk-sh_01 and group is RULE BASED with role of this alert in the group is Primary which means this is the primary alert of the group. So whatever happen with this alert is cascaded to below alerts i.e. secondary alerts like state, feedback of group,etc. I have highlighted secondary alerts as well in below screenshot where group field clearly shows that it is a secondary group of alerts.

image

(Rule Based Alert Correlation)

Reason for showing below screenshot is to explain that whenever the rule is applied then in group column you can see it show "R". Similarly we have "A", "M" and "C" for other groups.

image

2.Automated Grouping :

This grouping and correlation mechanism has second priority and it will run only when property "Enable alert aggregation (sa_analytics.aggregation_enabled)" is true. RCA and alert aggregation helps us to automatically group the alerts based on CI or Node field on alerts. If CI is empty it uses node field to aggregate those alerts. Also on important point to note is it create a virtual alert as primary alert and adds all other alerts as secondary alerts in that group. This type is a unique case of service analytics which incorporates machine learning to group alerts.

Use Case:

Group alerts coming from different sources for same Node/CI created in last 1 hr.

Event, Alert and Grouped Alert:

image

(Event)

Below you can see that we have 3 secondary alerts and one virtual alert whose source is Group Alert, this is automatically created. This is nothing but alert intelligence and you can see this on agent workspace under list as show below. You will only see primary alert there and not secondary alert.

image

(Alert and Grouped Alert)

image

3.Manual Grouping:

As the name suggest this is way where operator can add alerts to group manually by assigning parent to an alert meaning you can add alert into parent field of other alerts as below. Alert0010056 is made secondary by making alert0010049 as parent. Once you do this it will automatically group them into manual group and is highlighted with "M" as shown below. Once you do this automatically ServiceNow make note of this change and next time while grouping it will make use of this pattern which was used to group alerts.

image

(Alert form)

image

(Group Alert)

4.CMDB Grouping:

This type of grouping makes use of CI relationships in suggested CI relationships table. There are few properties in system which should be true to allow CMDB alert group. So please go through this link to see if the properties are enabled or not https://docs.servicenow.com/bundle/orlando-it-operations-management/page/product/event-management/co.... One thing which is very important to note is, this rule is applied only if automated and rule based grouping is not applied to alert. This can be easily seen by the service map why they are grouped.

Use case:

Create alert for different CIs on the same application service. They should be automatically grouped in CMDB group based on CI Relationship. We will create alerts on highlighted CIs as below:

image

Event, Alert and Grouped Alert:

See below screenshots which explains how the events are created with different different alerts Alert0010152,Alert0010153,Alert0010154 and Alert0010156. Once the alerts are created the Aggregation engine runs and try to group them into one virtual alert using grouping mechanism. So Alert0010155 is treated as parent virtual alert for all this alerts and it is shown below.

image

(Events)

image

(Virtual Grouped Alert)

image

(Grouped Alert on operator workspace)

Why it is required?

After looking at above examples there are few obvious reasons why this correlations are required. In short below are the reasons why it i required.

  • Reduce Noise.
  • Helps operators to solve alerts in bulk.
  • Reduce overheads.
  • Helps providing feedback in bulk for group alerts.

Concluding Notes:

We saw how to group alerts and create alert correlation rules, there utilization and when they are applied. Few important take away are:

  • At any given time alert should be part of only one group.
  • Correlation rule is applied only when new alert is created or the alert status changes from Closed / flapping to open or reopened.
  • Closing of primary alert is cascaded to secondary alerts.

YouTube Video : https://youtu.be/P1OB48PZxLw

Please comment and please suggest if anything needs to be improved.

Please don’t forget to mark helpful ,bookmark this article and subscribe my YouTube channel.

Thanks and Regards,
Ashutosh Munot

ServiceNow MVP 2019/2020

My Article and Blogs

YouTube Channel

View original source

https://www.servicenow.com/community/itom-articles/event-management-leverage-alert-correlation-and-grouping-for/ta-p/2320977