Demo - ITOM - Eliminate Outages with AIOps

when problems are detected operations teams need to pinpoint root causes and take action quickly taking advantage of captured knowledge and automation when there's an outage or service degradation it's a herculean task to efficiently detect issues determine the root cause and resolve problems quickly for instance assume all vpn access is down for the west coast a critical alert is generating thousands of events flooding in from siloed monitoring systems for the routers servers and applications related to this outage your i t operations teams managing these siloed monitoring systems then begin the manual process to sort through thousands of events coming from these many systems to pinpoint the root cause much like finding a needle in a haystack in this demo you'll see how servicenow provides a holistic approach to ai ops by combining key capabilities from its it operations management health and i.t service management solutions delivered on a single platform where data is seamlessly shared servicenow reduces the noise of events originating from users services applications in the cloud by using event correlation algorithms improving the operator experience rather than being barraged by every single event operators receive a proactive notification relevant knowledge articles similar issues that have happened in the past and a suggested remediation in some cases the remediation is automated the operators see a description and timeline of the event the service is impacted and remediation task recommendations this makes life much simpler for operators and improves their performance and experience with aiops capability of servicenow organizations can achieve optimum value in it operations by predicting issues before they impact users and the business detecting prioritizing and assigning issues when they happen and diagnosing and fixing issues for good as we start our demo you'll assume the role of our it operations analyst naomi you're starting your day and log in to operator workspace which monitors services in your organization's event management environment here you can review the status of your services and view those at risk of not functioning optimally enabling you to address those issues quickly as an operator you notice in your operator workspace that a few services are reported with alerts you can group sort and filter the data in many ways you see three services have a critical or red severity and that one service is in a minor yellow severity you're ready to look into the alerts which are listed on the basis of alert priority and urgency clicking on an alert highlights the specific service or services on which further remediation action could be taken if you wanted to go directly into the alert you can click on this icon on the alert record in the filter to the right for now let's see the details of the service that's impacted clicking on the service will open the service preview tile under which basic info can be seen then clicking on service details opens up the service details form in a new tab within your workspace on this form you may verify the service details such as entry points which reveals that this is on aws to better understand the infrastructure that supports this service let's open the service map right in the workspace this gives us a quick basic view of our order status application service we can get an even more robust view by opening up the full map on a new tab in the full view you can see the topology of the service a timeline of what happened the related alerts and the single alert that they're all correlated to we see that five uncorrelated alerts on that service are from several different data sources you can slide the correlated alert switch to show the single alert that these items are now grouped under note that the source now reads group alert this is a great example of how servicenow reduces the noise for it operators if you select the ci that's read in this service you'll see that the ci is an oracle database you then see only the uncorrelated and correlated alerts associated with that ci service mapping offers functionality to compare the changes and attributes of a ci over a given period of time you wish to do a quick comparison to understand what attributes changed on that ci this info will help you make informed remediation actions you click on the compare button on the timeline by selecting the timestamp for point 1 and .2 then once the compared view map opens click the ci for which changes need to be seen in this case that oracle database we'll click on the highlighted configuration file of the ci to see the comparison on the comparison you see that in the configuration file sql trace is marked to be as true which means it could log a huge amount of information potentially filling up the disk space once you've viewed the correlated alerts under the service map and understand the impacted cis and related components you can go ahead and assign the alert to yourself back on the alert details page just click assign to me and the alert is assigned to your queue this enables tracking ownership of the alerts within your teams and individuals now that the alert is assigned you can drill down into the details of the alert first we'll gather some background information on it then determine the root cause and finally take action to resolve it at the top you can see details of alert severity priority and the affected ci on the details tab you can review how the alert priority was calculated from a variety of factors let's understand how the alerts were grouped by checking out the activity log here we see how event management set the primary alert and secondary alerts first the activity log states that the alert was set as primary as it determined it was the root cause next event management grouped the other alerts based on an identified pattern finally in the activity log we see automated incident creation was because of an associated rule this could also be found under the alert execution tab of the alert record the impacted services show you the services that are affected by this alert giving another sense of the scope of our alert secondary alerts were correlated under this group alert again demonstrating how servicenow reduces noise while maintaining key information to help determine root cause alert executions show actions already taken even before you directly viewed the alert in this case this type of alert was configured to automatically create an incident as indicated by the inc number and the past actions taken including attaching a log to the incident so you don't have to log into the server itself and download the file this was done automatically right when the alert was created again this saves time to get to resolution faster now that you've got background information on the alert it's time to work on determining the root cause first you can review the alert timeline to get an idea of how the various alerts happened as we display the legend you see the first alert became the root cause and the others were secondary each with its own severity another tool you have is agent assist which leverages platform machine learning capability to find the most relevant knowledge base articles for this alert typically a knowledge base article provides details on similar alerts and suggests how the issue is resolved this is a significant capability that captures knowledge over time so that it's easily retrieved and helps operators figure out how to resolve the issue more quickly you find an appropriate article the third one and add it as an attachment to the alert it then shows in the activity list now a significant capability of event management is to automatically fetch other alerts incidents changes and problems related to this alert using platform machine learning capabilities we'll click on the insights button and see an alert which has been determined to be similar you can drill into it and from there you can view the incident that was created and check out how that incident was resolved we'll scroll down to the resolution information section and we see that the workaround is to clear the log files from the temporary directory okay so in just a few minutes we've gathered general knowledge of the alert and have a pretty good idea of the impacted service what the problem is and how to resolve it our operator decides to expand the disk space another option would be to remove the log files to free up additional disk space you click on the actions button and select expand disk space to start the flow once the flow run is complete you can check the activity log of the alert which says disk expanded the alert state is also updated to be closed after the disk size is increased and the service returns to green we can update the incident and close it as we head back to operator workspace if we look at the status of our order status service we notice it's not green but it's actually yellow with a minor alert this is because the cmdb group alert still affects this service you now know what your next task is and you can begin to assess the cmdb group alert this demo illustrated how servicenow itom health helps it operations reduce mean time to resolve by reducing noise enabling root cause determination and automating manual tasks all of this helps it operations predict issues before they impact users in the business detect prioritize and assign issues when they happen and diagnose and fix issues for good for more information check out our itom health product pages on servicenow.com as well as the servicenow product documentation site for itom health