logo

NJP

Now on Now: Predictive Escalations with Machine Learning

Unknown source · May 12, 2024 · video
  • All right, (passerby chattering) welcome to this Now on Now session, which is Predictive Escalations with Machine Learning. If you haven't been to a Now on Now session yet, what is that? It is ServiceNow internal teams using our own platform and products to serve our customers. Safe Harbor Notice, please read fast. There's a quiz at the end. And we'll get to that in a few minutes. But if you all had a crystal ball, which there may or may not be one under your chair today, yeah, there isn't, they wouldn't fit in the envelopes. Sorry. (JP laughing) But if you had a crystal ball and you can gaze into that and get out ahead of problems and challenges before they actually became escalations, that's what we were left with and that's what we were, what we are actually today have implemented and using in order to help prevent escalations with our customers. So we'll go through that journey of how we got there. But first, my name is Brian Wilson and I lead the escalation management team at ServiceNow. And that is composed of major incident managers and account escalation managers and account escalation engineers. - All right, thanks Brian. My name's JP Renaud. I run our customer support, so our technical support team. So if anybody is ever called in or opened a case because you needed some assistance with the ServiceNow product, that's my team. So along with Brian, one of the things that he was talking about in terms of the crystal ball is what is a problem that you would like to get ahead of? What is a problem you would like to see before it happened so you can prevent it either for your own company or for your customers? So we're gonna be talking about our three predictive models and how we've started to tackle this problem here at ServiceNow. So what does a traditional support experience look like? So I'm the guy who has to present this and say, "Hey, the traditional support experience is not necessarily the ideal experience." When you talk to our support engineers, we want you to have an amazing experience, but the reality is probably most of you would've never had the problem in the first place if you had your choice. So, you know, you've experienced an issue on your instance, maybe it's a performance issue, hopefully it's not an outage. Your users complain, you all investigate. If you can't figure it out then you reach out to ServiceNow. We investigate. Hopefully this transpires very quickly. Unfortunately it doesn't always. What if it never had to happen? So what we tried to target and what we were challenged with is let's look at very specifically some of our nastiest P1 performance issues or outages. And what if we could get ahead of that? What if we could see early signs and determine something is about to happen. Maybe we can solve it ourselves or maybe we can reach out to our customers and work with them quickly to solve it before the instance has a major issue. So Benjamin Franklin's famous quote after the Philadelphia fires was, "An ounce of prevention is worth a pound of cure." So we wanted a little more than an ounce, but we definitely wanted the pound of cure. So we built three models. I'm gonna talk about the first two. The first one was what we call predictive performance alerting. The second is top transaction trending. In our industry we have to have acronyms for everything. We'll try not to use them. (Brian laughing) But the key to these two models is we use domain experts, right? So our subject matter experts in the area, people who dealt with performance issues and dealt with outages all the time, they have a really good idea for what causes these. What are the early indicators, what are those signs? And if you went back to Brian's crystal ball question, hopefully you're thinking of a problem for your company that you may want to get ahead of. It's probably different than ours. Maybe it is the same. You have people in your company who are really good at resolving those. Those are probably the same people who would be really good at saying, "Hey, if you look for this, this and this, the early indicators you can go and potentially get ahead of those issues." So one of the things that we did early on that paid huge benefits is we pulled in our domain experts and we said, "What are the early indicators?" So the predictive performance alerting, we looked at ServiceNow, as you probably know, we alert on hundreds of different things across our cloud. And in our instances they narrowed that down to about 40 specific alerts that when you look at these alerts in aggregate on a single customer instance, so you think about it vertically as opposed to horizontally across the farm, these alerts in aggregate can give us some pretty good signs of a potential issue coming. And then top transaction trending, very different. This looks at the top 10 transactions in a customer instance and we basically monitor whatever the norm is. 'Cause the norm for one of our customers may be very different than the norm for another, and we're looking for a negative deviation off of the normal transaction time. Both of those two things will alert. Brian will talk about our machine learning model that came from that. - Yeah. So in 2019, our ML out of box feature was available for us to start utilizing, which made it really exciting for us of how we can take our already models that we made from PPA to TTT, but feed that into machine learning. So how did we really do that? We took those alerts, we took those thousands of potential features or input metrics and we divided them into three different categories and kind of made our way into a learning model for our machine learning. So first we took those alerts that JP talked about and we looked at proximity to their peak hours of usage. We looked at the weight of that alert, meaning how long did it last? So that duration, how many times did it actually triggered in this timeframe or this window that we're looking at? And then from there we split it into five groups. So we had these alerts, we got these metrics, we put 'em in groups of potential impact. So our domain experts that used to do this very reactively and work with our customers on these critical business impact, they knew exactly the things that they used to see when they went and investigated, so they knew the type of alerts that had much higher weight as far as potential impact, that an instance would be having poor performance experience or just a poor end user experience. aggregated that down into, like, the count, the mean, the median and other areas such as linear regression, decay average. So think that as a monitoring alert or something happens, it spikes and quickly goes away versus something that actually spikes up, stays there and quickly goes, like, slowly goes away. So much more impact of that type of event. So once we had these features, these inputs, and we put those together that we worked with our machine learning team, we fed them through a supervised model and then we fed them through a unsupervised model. And for us, we landed on the unsupervised model was the best outcome for us. However, though the domain experts that you continue hear us speak upon throughout the presentation, they're key. Because not only do they know the right metrics to look at, the historical situations that they have worked on, but when the output of your ML, is it hallucinating? Are they false positives? Are they false negatives? They're the ones that are gonna be able to quickly identify that and then work with the machine learning team or a data scientist team to better understand how to feed that through your model and potentially either retrain it or a train the trainer type situation. So now that we have the predictive machine learning model, how is it actually hosted in our now platform? Well, we have our monitoring system that's out there listening across our data centers, whether it's hardware or the software layer, that's funneled into our event management system. Those alerts, that's the event management system, then feed into our big data center, which is going to aggregate those, do trending on them and group them together. From there, that comes into our predictive nomination engine and predictive escalations. This is where we actually can go in and set the thresholds and fine tune those. And an example of that might be the day after Thanksgiving, right? Very high input for our retail customers and our FinTech customers. So there are times that we will do adjustments on that based on just the nature of our customers. So as JP said, instead of a looking across the farm, we're able to really dive in deep, whether it's a sector or a particular customer. From there it goes into our nomination engine and this is that likelihood of an escalation where our engineers are actually looking at that. And then it goes on to, yes, we do need to engage. This is a proper issue that the customer may feel if we don't get out ahead of it. So the account escalations dashboard and application is really tracking that engagement with the first time that we reach out to the customer to when we solve the problem. So if you ever are proactively reached out to an engineer from ServiceNow, it's probably a pretty good reason that we're trying to get out ahead of an issue that you today don't see. So think I'm driving down the straight road. You don't see that there's a 90 degree curve. You don't see it on your map, but they do and they're trying to reach out to you to get out ahead of it to provide a better experience and not that poor-- - This is our shameless plug. - Yes. - All of our customers tell us, "We want you to help us prevent a major issue." But I'll just say more than you would believe, when we do reach out, the answer is "Yeah, yeah, we're busy on really important stuff right now. That hasn't tipped over yet." - Help us help you keep it from tipping over. (laughs) - "Yeah, no one's complained about that yet. We don't see that as an issue." Well, we're trying to get ahead of that for you. So why use only one bottle when you have three, right? So this is the part where we created two great models with the TTT and PPA. We funnel 'em into PEML, but we still use all three today. And the reason is is because there are times where the TT is looking at something very specific and we're able to adjust it on the fly. And the same with PPA and they funnel into PEML but it helps us retrain our models as well. So just because you start down that journey of something that might be a little bit more of a manual process, of not using AI or machine learning, it doesn't mean that those actually need to all go away. They can help train it going forward. So how are we doing? How exactly are our results? Now back in 2019 is when we first started implementing this. And you can see from our escalation team, only 11% of our engagements were preventative and getting out in front of our customers. But if you fast forward to last year, 62% of our engagements are actually proactive in nature versus reactive. This obviously has provided a better experience for our customers, preventing escalations, but then also just reducing loads on a lot of our cross-functional peer groups, whether it's support or site reliability engineers who are watching monitoring alerts. So we're trying to get out in front of this more and more as we go. JP, would you maybe, you know-- - Again, I know that our specific case may not be the same that all of our customers and those in the audience that are, you know, listening to us today. But the journey of how we got there and we looked at our KPIs and we made sure we got our domain experts. - Yep. - I guess wrap us up in some of the key takeaways. - Yeah, so, thanks. So I think, like Brian said, your problem may be different and at the beginning we struggled a lot. My boss is in the audience here and one of the things he said after about seventh, eighth call where we were coming up with a lot of great reasons on why you can't prevent this. If it was easy, everybody would be doing it. He said, "Okay, call's going forward, we're only gonna talk about what we can do." And then very quickly we came to some conclusions on, "Okay, well let's start here, right? Let's start here." 62% sounds really good. In a lot of scenarios 2% is really good. That's still a major benefit. So in hindsight we can hopefully share a few things. So define the problem and the goal. Your problem may be something that you're trying to solve because the quantity of times it happens is very high or it may be a low quantity, but a very significant impact. We probably could have gone after a different subset of cases and hit a much higher volume, but we were trying to solve for as many P1s, as many performance issues and outages that we can keep our customers from having to have, that was an impact issue. So that's what we were trying to solve. Identify your domain expertise. I cannot emphasize this as enough. The people in your company who know where the bodies are buried in terms of the problem you're trying to solve are the people that are going to get you started the quickest and typically what, not typically, but what we've found out from when we started engaging the machine learning experts and the data scientists is they were like, "Sometimes what they come up with is better than what the machine learning can come up with." Not always. Machine learning can often take that as a leaping off point and progress even further and that's what we found. But definitely identify your domain experts and bring them into the mix. And it's probably not just from one team. Collect, explore, and reprocess the data. There's a lot that goes into this, but once you get to, once you start pulling in data scientists or people who really understand how to analyze the data, you'll figure out pretty quickly there are some outliers. What is the right machine learning algorithm? It's not a one size fits all. Like Brian said, we did a supervised and unsupervised and then there's a lot of training. But once that gets going, all of a sudden it starts telling you things that even your domain experts, when they first heard it, they're like, "Oh, that's interesting. That does make sense, though. I wouldn't have thought about that." So something that maybe you were, like, "This always happens," machine learning may come back and say, "Well, it does but never without this. So you don't need both," right? And you start really quickly identifying false positives and false negatives and tuning your model. 'Cause everybody who's ever dealt with alerting knows you can alert on a lot, you can also create a lot of work that may not be necessary. And then build and start evaluating that model. And that's where I was talking about training and continuing to retrain. So we did wanna, we talked fast, hopefully everybody listened fast. But we wanted to leave a few minutes for questions. So if you have one, if you can just project up here loud, we'll repeat your question and do our best to answer and then they're gonna kick us off stage in 5 minutes and 23 seconds. But you can find one of us over here if you want to ask more questions. So anybody have a specific question? Yes. Start from this. Can I create an incident drop task? - Yeah, what's the outcome? Okay, could you jump back a few slides? So basically if one of the models, we had three and like Brian said, they all three kind of pick up different things at different times. It actually, yeah, that was the right slide. It actually opens what would be the equivalent of a case. But we have a nomination request in our escalation system. So that particular piece is built on ITSM. So it's opening an incident in essence, but you could open a case, you could open whatever, and that gets assigned to an engineer for a human review of what the model said. The human review validates it quickly, then we're, that human is the one engaging with the customer. Great question. Others? I got one. You guys have a health page, by any chance? Like something that tells you the health of your systems or anything like that? - Oh, oh yes. Yeah. So the question was do we have a health page? Do you wanna talk about, like, some of what our GCS team looks at all the time? - Yeah, so internally, yes, we do have a health page. We have, you know, dashboards and metrics that we're always looking at. But are you asking from a customer standpoint, what are you able to look at? So I'm thinking of if a customer looks at the page that can say, "Oh, I've got an outage," and then look at your health page and it can say, like, "Healthy, dead, healthy, dead." You know what I'm saying? Maybe it's a light that's red or green. - Right, yeah. So we actually have telemetry that we are always looking at from an instant standpoint, a data infrastructure standpoint. So to answer your question, yes, we do. Now from a customer standpoint, there are things that you can look at when it comes to your transactions. - Instance Observer is what comes to mind for me. - Instance Observer, which I think there is a, it's over at the Impact booth. - Yeah, so go by the Impact booth and they've got a demo on Instant Observer. - Which allows you to see a lot of similar metrics to what we see on the back end. - Great question. Yes. You're meeting your quota too, though. Can I do capacity management in some way or the other through the output of this? - In our model it doesn't have anything to do with capacity, but I don't see why you couldn't. - Right? Yeah, absolutely. I think that would be a great use case for this. Okay, sorry, I wanted to just ask about the, when you get the quality of the data, are you getting it from all in one place or are you getting it from multiple systems brought into a big data and then reprocessing it into the system? To train everything and then reuse as the data is processed. - The second thing you said? - So multiple points into our through event management and kind of our big data model. And then we also use our platform to then apply weighting. Correct? - Correct. Yes. - So the weighting determines kind of a score. Like, I'm getting a little under the, you know, under the covers here a bit, but the weighting discover determines the score and then it's kind of a rack stack of, like, this is the instance that's probably most likely and then it goes down, right? And then we could build a threshold wherever we need, like, Brian mentioned, we move the threshold sometimes. It's a two part process. - Yep, yes. - Correct. - Two part process. Correct. - And, you know, ServiceNow has event management and alerting that do trigger and go straight to cases and technical support engineers work on it. What we are tasked with is that it's slowly getting to the point where we think it's going to tip over. It was harder to predict versus those that happen like that. And you know, obviously whether it's a network outage or something like that, you can't predict. What we were looking at that is that slow, like, I'm having a poor user experience to now it's critical business impacting and that's what we're trying to predict and get out ahead of. Yeah. Thank you. It's really good. Good session. Are you satisfied with sort of the continuous improvement in retraining of the models, particularly from an automated perspective versus having people having to go in and tweak the model because the data's forcing you to do so? - Yeah, yeah, I think we generally have only had to retrain it once to twice per year. And that's usually because of new features or something new that we're able to get from a telemetry standpoint to help feed it. But for the most part, yes, we're very happy with it. We're so happy with it that I would say we have to turn it down because we only have so many resources that we can actually go and engage or customer's not willing to engage with us. So that was that shameless plug. If you hear from us. (laughs) - But I do think, yeah, 100%. I also think we've talked about, like, adding additional parameters as well. - Because like Brian said, there are things that are kind of slow burns that we can get ahead of a little easier if there's more time. What if it's a little bit of a shorter burn? What are our options there? What are some of those late metrics? Right? And then our bigger challenge is then how do we get a human involved with another human to go solve the problem? Because, you know, we may see something, a query that's just about to shut down an instant, something we're not gonna go off and turn on somebody's query, so how quickly can we engage? So I think in those scenarios we will retrain again, but that's generally when we add additional parameters. - Yep. Well, hey, we're outta time, JP. - All right, we're outta time. Brian will be over here. I have to run, but great questions and thank you all for coming. - Yes, thank you. (audience applauding)
View original source

https://players.brightcove.net/5703385908001/zKNjJ2k2DM_default/index.html?videoId=ref:SES1396-K24