logo

NJP

AIOps provides more value when combined with better experience

Import · Mar 07, 2019 · article

image

The Now Platform doesn’t only act as single source of truth which we can automatically populate and keep up to date in real time, but also acts as an automation engine allowing us to add new capabilities without requiring any process level integrations. This way, whenever we add a new machine learning capability, it automatically starts working on that always normalized and up to date data as a natural cell of the platform. And that gives us required visibility and automation to achieve the agility we need for our business. This kind of a strong foundation is essential for AIOps as well as any other processes and workflows we have in our IT.

Ok, now we have a strong base; so what’s next?

#2: Put Everything in Business Context:

Alright, let’s remember our goal.

We want to give a big red “make it easy button” or something similar to our users; so that they can answer their questions with a push of a button. Questions such as “What is the root cause of that slow transaction in my most critical application”?

Well, the bad news is that we haven’t developed that magic button yet, but the good news is that we believe we have the next best thing for you, which actually looks quite similar to a big red button.

Here is what we’ve done;

  1. We received all your events, normalized them and grouped them using machine learning.
  2. We did the same for metrics; learned their behaviors and generated anomaly alerts, again using machine learning.
  3. And finally, we bound those alert groups to the right CIs which were already associated with business services supporting our business.

I know, that sounds like a lot of work but you don’t really need to worry about it since most of it will be automatically taken care of in the background without you noticing, thanks to shared data model and shared intelligence.

So now all you need to do is clicking the big red box representing your high priority impacted business service or application;

image

In order to drill into its details and detect the problematic components with associated alerts:

image

#3: Answer Your Questions:

But some questions are harder than the others and we might require more information to answer them. For example in our case, even though we managed to isolate impacted CIs and impacting alerts with the step 2 (which BTW reduces MTTR significantly), we still need to get deeper to find the real root cause and fix it for good before it turns into a “problem”:

image

And for that, the Now Platform gives us a specialized unified interface called “Agent Workspace” where we not only find every bit of required information (alert, event, CI, business service details; associated incidents, changes, problems, knowledge articles, tasks, remediation actions, etc.) to solve our issues as fast as possible, but also;

  • Detect similar cases (alerts, incidents, problems, changes) via machine learning
  • Attach accurate knowledge articles automatically, again via machine learning:

image

  • Collaborate with our colleagues in real-time:

image

  • And trigger remediation actions:

image

And that’s how it looks when we put them all together:

image

You most probably noticed that I didn’t put a screenshot under one of those capabilities above: “Detect similar cases (alerts, incidents, problems, changes) via machine learning” a.k.a. Similarity Framework. Because I want to touch one last very important point.

The Agent Workspace significantly facilitates root cause analysis process and helps us to reduce MTTR up to 90%. It does a great job of solving issues in real-time. But how about mid and long term issues we’re having, things that we can consider as “problems”. How we’re going to know if the alert or the associated incident we’re working on has or hasn’t happened before. How we’re going to know if we keep getting “similar” type of issues impacting our business services, our applications.

Well, remember the “shared data model”; we already have the required data to answer those questions. The only excuse for us to not to use that data may be because it is simply too big and it is not feasible to scan through it manually. But then we have the “share intelligence” to help us to automate these kind of scenarios. And the “Similarity Framework” is the capability under our shared intelligence that can answer those questions. It uses natural language procession techniques and CI based analysis to identify similar records automatically (alerts, incidents, problems, changes) and puts them together under alerts. All you need to access it is clicking the “Insight” button:

image

And there you go:

image

image

Now we can automatically detect our recurring issues and go ahead and deepen our analysis including data from other processes to fix them for good.

Keep simple and carry on…

In summary, that’s indeed what we tried to do; we took all the technical complexity and boxed it in 3 logical steps so that our users can answer their question with only a few clicks:

image

#1: The Platform > to provide a shared data model and a surrounding shared intelligence to support digital transformation initiatives including AIOps.

#2: The Easy Button > to detect issues and diagnose the root cause automatically.

#3: The Solution > to dive deeper and fix the issues permanently.

And that kind of an simplified automation helped organizations to reduce Mean Time To Repair and increase Mean Time Between Failures leading them to better quality services and to happy users.

View original source

https://www.servicenow.com/community/in-other-news/aiops-provides-more-value-when-combined-with-better-experience/ba-p/2272870