AIOps Continual Improvement
hello everyone welcome to a new episode we're going to talk about more AI in this episode we sit down with Jason Smith who is a director of outbound product management covering the it operations management product and we're going to talk about AI Ops and continual Improvement around AI Ops so let's have a listen how are you Jason hey thank you very much I'm doing great and I'm the director for Alber product man with focused on AI Ops and observability that's great we welcome to have you and I know you're joining late hours from Sweden so really appreciate it hopefully you'll get some sleep after this all right cool so let's kick the gears right in and I want to start um and I know you spend a lot of time talking about EI Ops as an offering as a service as a product and that fits really well into a discussion we want to have today so we want to explore a couple of things right away like what are the leading indicators of the AI op success you know what are the right metrics for success and maybe you want to start with kind of start talking about a little bit of what the AI Ops kind of framework looks like and then we can dive into even more um kind of conversation from there sounds good thank you what we're we looking at here is how you could model the AI op service itself so this is not the model is not shipped out of the box but you can go into the SC to be cre these the business service and the related service offerings so generally AI Ops is in fact a shared service for organization and you could consider that there's uh a few related service offerings that that that business service pow so one of them is probably the most popular and the most well known uh is really catch a resolve automation that's why we're using AI to Pro proactively Drive alert management so literally dispatch notification and response Automation and then you have management analytics where you go in and do your kpi reports so at your dashboards look at reporting identify teams that are doing really well and replicate that throughout the organization and then there's a more technical aspect to it which is monitoring the observability where we're literally looking at the Telemetry coming from events metri and LS whether that is on Prim or in public Cloud um yeah so let's go to the next yeah I want to uh on this slide as well actually is um is this something out of the box or we talking a little bit about this is a service creation just like for any customers right I mean in service our world people think uh of services um so how would this um kind of come come about in the platform is it something out of the box or is it they have to create some services around this so at the very bottom you'll see the service now event management application service so that contains all of the components that you've purchased that you're using uh for itom health essentially so that's literally event management metric intelligence health log analytics so those components that you that you use even things like the mid servers and connection to various monitoring tools those are modeled automatically as an application service in the cdb so today we don't ship with related service offerings we don't ship with the uh with the overarching business service so this is some modeling that probably needs to take place within the organization um what we have here is a is a pretty high level overview um in reality the customer may have a more complex environment that they need to to deal with so if we take something simple like catch and resolve automation we know that we want to get really good at proactive alert management we know we need to address dispatch certification and response automation but the fact is there may be many departments and many teams within the it organization there's also the potential for multiple uh managed service providers that are assisting the the Enterprise right so U there's definitely room here to have more than more than one um service offering uh even just related to catch and resolve automation you I think one one thing you mentioned is good also be based on teams and applications they're managing for this specific offering so let's say if uh event management or AI Ops from service now is managing specific set of applications they can create an offering around that right that's correct and and so the offering would include uh things like what your commitments are and that could be uh like the availability that you need throughout the year so maybe you're committing to 99.99% availability right so it's it's a it's up to the um the implementor to decide what you're actually committing to it could be something like I'm I'm promising to have five people with the Java programming skills in the service desk right so you need to come up with the with the commitments that are relevant for your organization um and then there's also a varying degree of who the subscribers are so a lot of the services that we that we end up monitoring and we're doing this catch and resolve automation with um they could be other shared services internally like expense reporting or it could be something for consumers like a a mobile banking application I I like this because I think we start to think about AI Ops as a product right I mean that's really what you're doing here if you're kind of monitoring it as a product you're going to report on it we you're going to show us a little bit later but yeah it is kind of for a continuous product uh Management in a way that's actually one thing that all the customers uh talk about uh that are really on a solid AI sh you mentioned the word continuous so all the aobs customers that we have that's that's one common them where they're really talking about continuous Improvement and so that's what we're going to get into a little bit today what are some of the things you're going to look at to really drive a continuous Improvement okay um there you go I qu you up uh go ahead uh and take you to the next level very good so uh we're looking in um a product called digital portfolio management I've I've modeled up the aops service then what we're looking at is the availability of the service itself it can be quite complex the aops service you may have many underlying monitoring tools uh you may have many different Integrations that that were um that are implemented so all of that needs to be uh monitored and managed also right so kind of the who's who's watching the watch doog right um and then there could be uh related kpis there are related kpis to that overarching service and it St like the that you're looking for meantime to resolve that how many requests are you getting to do additional things with that service maybe somebody wants additional type of notification maybe somebody wants to have new policies for monitoring or install ages and things like that so there's there's many different types of tasks that are associated with these the services and service offering so we can't see it on this screenshot but the uh one thing that's interesting here is those the tasks or the kpis related with the underly under lying uh service offerings they bubble up to the overarching uh the business service so you can uh really in detail manage all the all the different requests for example that you have related to the service offerings and just for our listeners digital portfolio management also known we call it DPM sometimes this is part of the um it service management portfolio right that's right so it's it's it's part of the it um itm so um you can see on the top left there under under AOS it's got plan build and run so when you're looking at these kpis you you often will realize that there's an additional demand So within this interface you can go in and create a new demand and that's pipeline to your projects and you go to the demand workbench and you decide which demands are we going to act on what's the profile that's going to be able to act on this is is something that we need a programmer for uh for an really Advanced remediation so you can really start to manage the whole project and the whole life cycle of the AI Ops service itself okay okay sounds good um so now that they have created this they created a demand let's say created kind of the hierarchy of the AI op uh product uh I think the next step really is that we start to look at how to start um reading the improvements what is the reporting looks like what kind of like you know uh performance indicators we start to to to you know ingest over time so we have uh the benefit of this common data model so inside the common data model we've got all kinds of organizational information different teams hierarchies reporting and all the assets and configuration items that are available in the on pram and public data centers all the tasks related to those um to those objects and then the other Telemetry in the form of metrics events and laws so uh we're able to create um use that data from the common data model to create you know really any kind of dashboard that you could that you could imagine but the things that we're looking for are most often are are the norstar kpis for AI Ops and ITN and then making sure that we're doing doing those really well and then there's more uh detailed kpis um in addition that could be interesting so we ship with many different kpis with the with the product so when you log on to the uh service operations workspace we do have reporter that's available uh right there in the service operations workspace it's really nice you can work your assits you can work your alerts you can get some reporting so in this case we're looking at some really uh Northstar kbis for AI Ops so average mgtr for incidents created by event management that number is $509 demo uh some just some basic demo data that have in a system um you're looking at things like the most critical Services impacted in hours how many hours were these services that you're monitoring having a critical impact and then also noise reduction so that noise reduction is really um you know the case for event management it's uh you know we're getting many the large huge numbers of events and then we're getting much fewer alerts and we're grouping those and and doing probable root cause analysis and they were identifying an actionable alert and if we identify that actionable alert if it's important enough the conditions are correct then we can go ahead and automatically open up that uh that proactive incident but um we're really good at making sure that the we're not it's not noisy so this noise reduction is literally that whole pipeline between event and incident one thing that I was meant to ask before also is if the customers um you know still on their cmdb Journey um and you know they may feel like you know we not we don't have a fully um ready cmdb uh to start doing aops or start collecting some of these metrics are they still able to do some of it as a starting point or they do need a good cmdb uh record to drive um drive this result so that's a very good question and really the answer is you don't have to wait for your scene to be to be fully populated before you get tremendous value out of a island um we're very good at organizing the the data with things called alert tags we've got automated correlation a lot of machine learning that we're applying to these different fls of telemetry and so we're still able to bake up the um the most actual the correct actual alerts and and proactively create just a small number of incidents so um now having said that um and what typically what customers would do is they can start with event manag but but having said that as you start to add information to your SC be will take advantage of that so if you forance make a a service map we've got the topologies the top down topology we see all the components that are used to to make that that service and how they're related to each other if they're upstream or Downstream if there multiple components if it's in kubernetes or wherever it is we'll take advantage of that information also and we'll use that in our probable root cause analysis routines that they're really always always running so there there's additional value that you'll get with aiops as you add information into the CNB awesome and that reminds me that some of that we already covered in the second episode where we talked about with with door and caras about the architecture and showed you some of the product as well including now for iton the agent um the alert anal analysis tool driven by genbi so thanks so then the analysis for itom if you um if you don't have any information the CB we don't have it there to S it to the prompt but we're still able to get from the alert descriptions um a a you know human readable easy to understand um response from from from Jim tur Ai and we're also able to go ahead and give a recommendation for a remediation now if we had the configuration item there uh then we could deploy an automated workflow because we probably have the credentials and then we can go ahead and complete the the cycle the the loop there and do automated remediation nice um here are some of the aop kpis that comes in AI Ops experience app right I mean you want to talk quickly about these yes so there is a surprisingly large number of automated kpis that we ship with the product and that many of them have to do with alert management so it's if you're trying to fine-tune your aops your driving those those improvements you can easily use these these kpis that we ship out of the box and there a large number there's also um kpis for itsm and the digital portfolio management ships there's many of the service now products I would say all the workflow products that ship um ship these kpis so this is with them for its analytics um and this is a a great place to go uh to to Really identify opportunities for improvement awesome and then finally um here's something else uh some more indicators you wanted to talk about yeah I did want to talk about what you could potentially see when you drill in this is a a really I think a big advantage of of being on the service now platform these kpis can ultimately be built with various varying types of of data within the common data model and it allows you to really drill in and identify opportunities for improvement so typically you would want to be looking for a team for instance you can what what teams are perform well compare them against each other what services are performing well the the the technical services the technical application services and then you're trying to identify what's working well so that you can replicate that but you're also able to identify what's not working well so that you can act on so uh in this grade shot we can see that there's a number of things going on we're looking for the average time to resolve in this in this case happens to be incident there's many kpis available um and this within before it's analytics you can do things like click to to show that the trend if there's an emerging Trend if it's trending up or down if that's good or bad uh you can put Targets in there so I'm trying to I've identified an opportunity Improvement and I'm trying to act on it so for instance I would like for my target to be for average time to resolve 0.005 days right and so it'll show up on the on the report there then I can see how if I'm trending towards or away uh from that um from that um target um so it's it's pretty interesting nice forecasting capabilities also so I can I can you know look take a look into the future with the forecasting capabilities here um I can look at um multiple services at the same time or I can drill into a very specific service it's very powerful yeah yeah I think um one thing is is for a lot of our listeners this might be new right both DPM and also the E Ops uh experience management also going to be new so what what would you say for them you know what the next steps they can take how to learn more and engage more what are your recommendations so for existing customers um to try this out first of all you should just upgrade this this to in production this is production has been um out for uh quite a while now so the airs experience includes multiple things beyond the dashboard so we did look at some of the dashboards um the dashboards that I showed you that are out of the box you can easily take those and and press a button to duplicate them and then add and remove modify in in a way that you see fits they're also good templates um so beyond just the the the dashboarding for different types of U measurements including things like um you know the VMware or or Azure or AWS or Google Cloud um we have um additional functionality here in the form of something called Express list so it's a live posible list of alerts um it's a it's a it's the most modern way to go in if you're really interested in working with alerts you can stay that one one interface without opening up many different tabs to get your work done everything that you need is right there and then there's a something called Launchpad which um is a really a huge improvement over the way that we onboard uh third party uh data so all of this is happening within the service operations workspace so AI Ops experience is literally an additional upgrade to the service operations for great and you can it's it's production already you're ready to go you can write in production um if you want to get your get tested out run it in sub prod awesome awesome all right I think this is a great uh kind of recap and I think uh this is a lot for will start thinking about uh the continual Improvement topic of e Ops um and thanks Jason for um talking about this this is great uh suent way of of getting the information through and I guess uh this is the last part of this a op series but we going to have a lot more topics being discussed and we actually going to get Jason back here for more discussion so thanks everyone for listening in um thanks Jason bye for now thank you okay goodbye
https://www.youtube.com/watch?v=_dpZuQKVAv4