Demo - Predict failures before they happen with AIOps
hi this is ben yukich with servicenow principal architect covering our it solutions now today i want to highlight what we mean when we talk about resilient it and ai ops specifically i want to talk about how you can get ahead of traditional operations by proactively detecting and preventing issues before your end users notice them i'm also going to highlight some of the latest techniques that organizations are using to keep even their most dynamic modern service architectures accountable and operating flawlessly first let's make sure that we're all on the same page about where servicenow fits into your overall it strategy now as you probably know servicenow is often viewed as a platform of platforms providing the connective tissue needed to make workflow effortlessly in many of the world's leading enterprises today we're going to focus on how servicenow fits in as the platform for it our goal is to provide a great experience with the platform whether you're an employee or a customer and we want it to be viewed as a partner and enabler of these experiences towards that end servicenow's it workflows help you simplify and automate the entire it software lifecycle taking advantage of our single platform data model and architecture to get more visibility and scale from your existing technology investments it's our core platform components like our powerful workflow integration capabilities our best of breed machine learning and analytics and our intuitive user experience that enable it to better serve your organization's strategic goals going a layer deeper i like to think of this as helping it to build a digital economy of scale for their organization by executing any one initiative within it say it service management our platform opens up unique advantage when you need to pursue other initiatives specifically the foundational data that you build in any one process discipline can be leveraged to accelerate your time to value in nearly any other process once you've successfully deployed our operations management capabilities you now have all the data that you need to rapidly succeed in asset management similarly managing vulnerability response enterprise risk or getting your application portfolio in order also becomes a breeze this is because there's no data to duplicate no complex integrations to offer and maintain and of course no new infrastructure to manage this is why it organizations who adopt servicenow see an unfair competitive advantage compared to their peers so let's start getting into our first deeper topic ai ops and its role in ensuring operational resilience many organizations today are focused on a mobile first strategy this typically means designing your online experiences for mobile before prioritizing any other device in light of the pandemic we've seen an even stronger push to make investments like this centered on how your customer community consumes your products most effectively as an example many financial institutions are encouraging their customers to engage via mobile whenever possible instead of coming into a branch however one frequent problem with mobile apps is that if user experience is suffering perhaps just due to intermittent slowness rather than an outright failure the app user is unlikely to contact customer service to report any issue even in fairly important failure cases you know say your mobile deposit isn't working it's most likely that you're just going to try again later or maybe you're going to stop by an atm so i like to think of these as silent failures although it might not be an immediate or a catastrophic impact it does diminish your brand and it may lead to your customer looking elsewhere thankfully servicenow's predictive ai ops capabilities can help oftentimes subtle operational challenges like these can be difficult to spot because you need intimate awareness of how your app can fail in order to properly instrument your monitoring however our ai driven raw log file analysis allows you to understand anomalous signals that you didn't know you should be looking for in our previous example this allows you to understand when an end user experience may be suffering before it leads to any catastrophic impact or outright outage of the service this is a fantastic complement to the monitoring work you probably already have in production as it gives additional context and insight into telemetry signals that might otherwise be ignored and this is by no means limited to financial services we also have an american multinational e-commerce corporation who's reducing their mttr by 53 percent using our predictive ai ops capabilities again this is all about complementing and augmenting the telemetry data that you're already collecting but with a focus on predicting failure due to leading indicators rather than just responding to the lagging performance indicators already in your environment so let's take a look at this in action so i'm starting here in our operator workspace and this gives me a high level view of any of the services that i might be responsible for as well as the health of those services so i can see each tile represents an individual service or even just a set of infrastructure and i can add a little bit more detail to understand what are some of the alarms that are currently in this environment and i can use this data to filter down to just those services that are impacted by any given alarm and from here this is my launch pad to get more data behind the scenes so if i pick on a critical service here like e-banking i can open up the service map and understand that there might be multiple issues happening against multiple pieces of my operational infrastructure but if i view the alarms that are present and actionable here i can actually see that all the alarms that may be impacting multiple nodes of this service topology it's distilled down into one piece of work for me to look at if i go ahead and open up this alarm i can see that we're consolidating nine secondary alerts against the underlying infrastructure that are occurring across this timeline over the span of about 10 minutes or so in this example and one of the things that's really interesting here is as i scroll through this list of secondary alerts some of these are traditional alarms these are things that you know your ops team is probably used to ignoring more often than not things like how much memory is free how high the cpu load is these are some of those lagging indicators that i mentioned earlier you know usually if you're making operational decisions based on this type of data you're doing it a little bit too late so to complement this and make it worth looking at we also have our predictive ai ops capabilities proactively looking through the raw logs and uncovering signals that our traditional monitoring approaches likely would have never uncovered without a massive amount of work in this case i can see that the primary alert has to do with errors being found in our ha proxies config file now some of these log based errors can be fairly cryptic so this is where we're able to leverage natural language understanding to quickly tie this back to human insights giving you a human readable understanding of what's going on in this particular situation i can see here the top result breaks down that there's an error in this load balancer due to a configuration file change and it gives me a link directly to the documentation to understand how to configure this correctly and this is data that can be sourced from within your organization as well as data that we provide in our own knowledge base from a number of different public sources of known errors so this is interesting but i want to dig a little bit deeper to see what work might be going on alongside this for that i click into my insights and i can see at a glance that again using our natural language understanding i'm able to surface the fact that there has been another similar alert in recent history and this may give me the insight that i need to figure out how to get this back up and operational i can also see that there are some open incidents there haven't been any recent problems but there is a change that's currently underway against both our current ci and a related piece of configuration so if we look at the change against the current piece of configuration i can see that indeed it is a change in the config of the load balancer we also have it logged against the service so i want to dig into this and make sure that the team is aware that they may have messed something up here so i can go ahead and open up this change and communicate directly with my itsm team leveraging the insights that i was just able to surface about this operational issue in this case i want to share with them that same knowledge article that same link to documentation that i thought would be useful to configure this properly i can add my own take on it including my own typos go ahead and attach this article and now the itsm team is kept up to date about what our operations team is finding this is going to help us resolve that issue more effectively and get our e-banking service up and operational before this leads to an end-user impacting problem so that wraps up our exploration of how predictive ai ops can increase your operational resilience now let's dig into the second topic which is how leading organizations are managing their increasingly complex modern technology architectures in servicenow so that they can take the exact same proactive approach to ops that we just stepped through even when the service is distributed and ephemeral most organizations struggle to maintain visibility and compliance in modern service architectures because the prevalence of microservice approaches the technology that you manage has become increasingly complex distributed and dynamic to simplify the deployment of complex agile applications many great technologies have been adopted like kubernetes as an example one of our customers had over 3 000 microservices deployed in kubernetes and needed to ensure that each application service owner could account for the resources their services were using so that the appropriate compliance and financial accountability could be maintained they accomplished this by using servicenow discovery and the cmdb but i don't want you to get the wrong idea here i'm not talking about the legacy discovery approaches that might immediately come to mind as an example if you're running discovery say once every day once every two days let alone longer there's simply no way that you can keep up with this kind of an environment a modern approach is to dramatically increase the cadence of discovery in a more targeted fashion in this case pulling kubernetes and updating the cmdb in close to real time to ensure that the data is accurate and actionable this approach allowed them to track how their environment is trending so they can stay ahead of any looming capacity constraints or wasteful resources and give appropriate visibility back to the business we also see more extreme examples of this very regularly even in seemingly simple environments take your own public cloud consumption you probably have a mix of infrastructure as a service as well as serverless resources deployed this poses a number of challenges things change rapidly and often there's no operating system present or you just don't have access and that can be a hindrance when it comes to understanding the deep configuration details for these modern environments we take an event-driven approach to consuming cloud data and keeping you up to date in near real time this also allows dramatically simplified understanding of service structure simply by leveraging whatever tagging strategy you have in place within kubernetes or within your public cloud providers so let's take a look at what this actually means in the platform i'm going to go back to my operator workspace simply as a way to easily navigate between some of these services let's pick on a slightly more complex example than e-banking let's look at our order status application here and i'm going to open up the service map and this is going to be a fairly complex topology where we've got a hybrid service structure i can see here that i have dependencies on other outright services i have some application components that are fairly traditional as an example you know an oracle database instance that's running on a server that i manage but i also have things like you know aws cloud gateway instances that are present here along with api gateway triggers that are routing our request to different components of this application service structure this is where discovery's understanding of serverless components and its ability to stay up to date in close to real time are absolutely key for managing a complex service like this effectively but we don't always have the good fortune to have this level of visibility or this level of awareness of how the service is structured so if we look at some other examples back here let's pick apart you know maybe some of the various instances of our rewards processing pipeline i've got a dev instance and a qa instance let's open up the dev instance and look at the service map in this case for this particular service this was actually generated based on our tagging strategy in our cloud environment i had this flagged as our rewards application specifically the development environment and so it was able to automatically harvest these details and put it into a service map in this case i'm also doing deep discovery of the underlying servers the guest os that's running on these virtual machines out in our public cloud so i was able to get the individual application components great level of detail but this isn't always attainable let's look at a slightly more restricted example rewards processing qa if i open up this service map we're going to see what happens when i don't have that level of access we can see for one particular application instance i've got deep inspection capabilities for the host the individual applications running on that host but for the other branch of my application service structure all that i have visibility into is the virtual machine instances that happen to be running in this cloud environment or the serverless components that are helping fulfill any requests for this application this approach allows for incredibly rapid time to value leveraging even rocky access to your cloud environment because really all i have to do is get read-only access to understand what are the resources and how are they tagged and i can use that to perform bulk mapping operations to tie the configuration back to the business context so i hope that gives you some insight into how you can quickly provide that business context for even your most dynamic and hybrid services and this opens up the possibility of leveraging the same predictive ai ops approaches that can help you increase service resilience for your organization thanks so much for the time today and please feel free to connect with me online talk to you next time
https://www.youtube.com/watch?v=O7C7o97NhWs