Creating an incident prevention workflow with AIOps
hello everyone Welcome to our session creating an incident prevention workflow with AI Ops this is a recap of the knowledge 24 session with did hi Darius how are you fantastic see you again Usman um so before we dive in I just want to kind of do a quick Preamble for our listeners and just ask you a question really that do you really know where your incidents come from we see that 40% of incidents come from this user request could be typical incidents software requests outes and a lot more but what about the other 60% they come from actually machines and I'm not talking about Skynet but mostly like monitoring systems you got a lot of monitoring systems giving you metrics events logs and they create a lot of complexity now the interesting thing is we see the growth happening rapidly 10 to 20 % growth which is significant if you take into account you got six to 12 or some customers you would have two just m tools out there so how do we really wrap our head around and really take the noise out of the system and focus on what we really want to do so what we're going to talk about today is some forward-looking way of how to get around instant prevention and focus on more alert based Concepts and we're going to show you some forward looking stuff so just as a safe foror there so before I dive in and actually darus is going to walk you through a workflow and show you a demo I want to talk about what are the three things that actually keep you busy with a lot of complexity and not getting you ahead so the first thing is there is a lot of unplanned changes in your environment they do cause outages however the the main issue is you don't know what you don't see and there's a lack of visibility and observability in your systems secondly you got a lot of noise yes you get Advanced metrix logs but what happens eventually is it leads you to false positives and you can't focus and typically you can't really get to root causes very very fast and then finally lack of automation is the number one issue that keep you slow to responding issues so a lot of times empty DRS are slow uh meantime to detect is very slow because you don't have automation so let's talk about how do we get you out of this chaos and get you right on track there is I'm going to turn it over to you to walk our listeners through stepbystep approach that they can use with the Ops and modern Ops perfect perfect and I think it's a great landscape of three big issues that we'll see how we can solve them with some of the solutions we've been investing in from an IT operations management perspective here at service now so how do we get it right right how can you make more sense of your data get better visibility into the health systems and automate the opportunities around remediation so let's talk through a journey if you will of proactive and self-healing operations and as we highlighted the start of every journey of understanding how your system systems and machines are operating begins with understanding data related to those machines health and today you're probably getting that data from numerous different application performance monitoring network monitoring different monitoring and observability solutions that gives you an insight into a given Services performance now that's monitoring different dimensions like latency error rate saturation your red signals your golden signals and it's generating numerous events logs and metrics that fundamentally at scale represent the noise we were talking about but what we want to do and what we will see is providing you and your team on the operation side a concise and Consolidated summary through alert automation that is doing time based grouping text and tag based grouping as well as see to be in topology based grouping to really compress and group those different events and alerts together but also to highlight the relevant metrics and logs that occurred around the same time as those events so that you can easily get down to identifying that root cause of why did this alert storm just happen right what else was going on around the same time was there a change that was deployed on the service and so fundamentally that's all context that's feeding through the system that at the end of the workflow when the human is in the loop they can take advantage of all that context for collaboration between teams for on Call response immediately paging someone if we need to going in and triggering automated remediation or even manual remediation that a operator can opt into and run a re reboot right a uh script to get troubleshooting data and fundamentally reporting to give the business highlevel insights on what is the actual reliability and Technical performance of my services so let's see how this actually comes together right this is a great workflow but how does it actually manifest in the product and how does it deliver the three outcomes of improved service availability improved customer satisfaction and importantly improved agility to deploy more Innovation to your customers and employees with reduced risk and so yeah in here just a quick one to darus I think it's also important for customers to know that actually customers are using it today right so customers see upwards of 90% event reduction as you said you know there's a tremendous amount of 30% mdtr reduction that customers can see so there is an actual implementations going on we're happy to share later as well fantastic fantastic yeah and so this is not only an art of the possible this is a reality for a lot of our customers on how they've taken their data and they've all started in similar scenarios as many of you looking for faster time to resolve looking for better context and by integrating these systems and using these new applications they've been able to do that and So speaking of new applications let's take a look at modern Ops and incident prevention utilizing the suite of capabilities we've invested in in the iton portfolio and we want to start the conversation with a naly detection and that's going to surface in our service operations workspace which is our UI of the future where we're making these Investments to go and provide your teams the interfaces to get the data around what's going wrong so in this case I want to start with a a log anomaly example right so we have this great health log analytics and here we can see the volume of logs with an Oracle database is above normal so we've had a trend of logs and we've detected a spike right an anomaly based on the common operating levels and procedures and so as an agent I can come here and as an operator into the system and identify context from this alert around well what are these logs what are what are other logs that happen around the same time when did it happen what is that metric data that is showing me the change over time and what can I do to go in and say yes this is actually an issue so that the ml model actually gets more intelligent so I can see here there was a 358 per increase it was a anomaly compared to past Behavior around the same time for the same service and so what I'm going to do then is try to troubleshoot and figure out why did that anomaly happen and to do that I'm going to utilize the alert correlation and Analysis and a lot of the insights we get working in this modern Express list interface alongside some of the proactive recommendations we're getting out of now assist which is our Genai application of that new technology and so we know there was an anomaly the logs were spiking but why did it happen so as an operator I'm immediately diving into my Express list and I'm looking at all the alerts coming in specifically around the volume of logs and I can identify and see that hey I've got three other alerts that are all grouped together and if I read an alert analysis generated by now assist I can see that there's a lot of logs events and specifically on this rabid mq related to an order status service and I'm getting a lot of status code for weights on a on a server and the idea is it may not be three it may be 30 it may be 100 alerts that you're ingesting in the system your operators aren't going to be able to read them all so we're going to be grouping together using that automated grouping logic what those similar alerts are and then applying that generative AI to describe what is the common commonality in that group of alerts and in this case again it was specific to anomaly related alerts for this Oracle database and if I zoom in I can just double click on and highlight that this analysis is fully auto generated by our generative AI technology inous here at service now now as we continue the journey let's figure out how we can determine root cause so we're aware there's an anomaly related to logs we understand there's a grouping of numerous alerts and we can understand the context of that grouping from generative AI what is actually causing it why did that log count Spike and what else is going on in the system related to this checkout service and this given rabit server so next we want to determine what cause taking advantage of that additional context from your logs your metrics and your changes and so if I come in onto one of these given alerts I have a tab here called metrics where I can actually see for that impacted server what is that disf free percentage over time what's the dis usage clearly there's been a spike and we're running low on memory in addition to that I can go and I can change the time frame to see if there's other patterns I want to look for throughout the day was this something unique to this device right now or is this a repetitive kind of flapping behavior that is constantly happening against This Server so the first idea is we want to give you the right metrics in context to the alert to try to identify what is going wrong with this given CI help you get that context in addition to that metric data that we're able to add we can help you understand probable root causes coming through changes and this includes devops changes events that we see happening in your cicd pipeline on maybe an application service that's running on that given ser and so the idea is we give you the visibility of the health the performance over time when something went wrong but also the potential changes that could be indicative of the Cause right of what happened in your environment that changed that now resulted this Spike of logs and so I could see here there was recently on an order status service there was a devops task for a deploy a PR deploy that recently happened which could be the cause here and so this is a fantastic product with our devops change and devops insight to understand what is going on in my Pipeline and what are these pipeline events and so as we highlighted we get that context on what that root cause may be due to that change on that prod deploy and and if I click into that prod deploy I can get the full context in terms of what specifically changed in the codebase what was the deploy for was it for a new feature was it for uh set of quality enhancements and so this is a fully automated change request that was generated you know you don't have to work with a a big team to constantly document we're just pulling these automatically from your cicd pipeline and the changes that we see going through your release management so I get that context and I can check with the teams what was the actual payload and what code changed here against this order status service so now that I have a sense on what went wrong I want to collaborate with my team right I want to bring in the application owners the team that pushed that recent fix and code and so to know who to talk to I'm going to take advantage of native on call Scheduling on the platform I'm going to look at hey for this given application who's the team that supports it who's on call right now for that given team and then importantly I can utilize a kind of a chat Ops concept to receive notifications directly on a chat Channel kick off additional communication with additional stakeholders on a chat Channel and really discuss how we're going to remediate this Spike and this performance degradation on that server and if it was impacted by that recent change that got pushed on that order service and so while we're doing that remediation and while we're contra ating with our additional team members or team collaboration that's where the automation is going to come in that we're talking about structured Playbook automation directly on these alerts we just talked about the metrics for context well you also get playbooks for remediation so here you can Define actions like uh removing the log files freeing up some memory expanding dis space of it's a it's a resizable asset uh creating service degradations to inform your own and customers and your users maybe it's creating a major incident to get all hands on deck and to swarm and to create customer Communications and update a status page but fundamentally we want to bring in contextual playbooks that help users both diagnose and remediate alerts based on the type of alert it is and the context we know about the alert and so whether it is a human in the loop running the Playbook or you just have the system take automated actions that is up to you on your definitions and so it brings it all together the actions in this case we can remove those files and at the end of the day we can say did we resolve the issue and how is this service now performing now that we did resolve the issue and so to help the business get that answer and context done well how is this this checkout service and this Oracle server performing is this a one-off issue or is it constantly running into issues we need to talk to and train or improve the infrastructure of the underlying team that's where we get this new concept of service reliability management and SLI SLO management so here in the service operation workspace as a leadership team once all those alerts and incidents are closed you can take a high level view of your operating environment and say what are those Services what is the air budget remaining on those Services because teams that burn through their air budget maybe you want to prevent them from doing additional changes additional pushes to production and so that's a great new capability where you can report on and then see your SLO performance of your services over time whether it's availability latency aade saturation account based SLO duration based SLO the idea is you're now prioritizing the business objective of performance against the actual technical performance and we do this again to proactively get ahead of incidents because we want to get identification of what the culprits are in terms of low performing services to Pro actively invest in it to prevent the incidents from happening again so with that it's a lot of great context a lot of great functionality and capabilities throughout our service operation workspace surfacing logs surfacing metric data surfacing playbooks new SLI SLO information and really those generative AI insights and so I'll turn it back to you ismin to kind of recap what we saw and how it all comes together in our portfolio with service St Tom thank you darus this was amazing um I do want to make a couple plugs of course um most of the uh products and applications you saw from various demo are generally available and uh the the performance uh service performance indicators the SL SLI are becoming available in um in August sometimes right areas so they will be generally available for everyone uh and so so but other than that everything else is generally available for everyone to take a you know take action on now also I want to say that during knowledge 24 we had multiple customer sessions that uh acted kind of on this uh this topic so I remember our over on digal technology team had a session on AI Ops and they also talked about how they're using say generative AI the now assess for itom so take a look at that we also had a session from National Grade who talked about metric intelligence how they're using just like um darus was showing you so take a look at that as well I think about three things here first of all is the correlation you got a lot of noise so think about how do you take alert correlation and aggregation and how gen helps uh that will help you tremendously reduce rise reduction secondly take think about how to inject more reliability and service resilience using aops workflow and that will get you to self-healing that's where also you start to think about Playbook Automation and you start to drive more MTT and mttr reductions and we have customer other customers who are actively doing all of that so thank you for listening to us thank you Darius for giving us a demo and we look forward to talking to you reach out to us and happy to have discussions with you thank you thanks s
https://www.youtube.com/watch?v=3SONbwTb_wk