Share The Wealth: How To Use Event Management in ServiceNow
hello everyone thank you for taking some time away from your busy day to learn about the exciting world of event management so this is the document and I'll share this out afterwards it covers everything that you see here I'm not going to cover everything obviously but how event management affects the organization recommendations for periodic service review how to help the customer accept the adoption of event management some of the integrations in interactions and how one would actually implement event management itself is events are not actually collected by ServiceNow and by that I mean it ServiceNow is not a tool and when I say tool I'm using air quotes it's not a tool to go out and actually run an agent on your endpoints your network infrastructure it's meant to ingest data from the endpoints tools that a customer already has in place so when talking about an event management person will say what do I do with all the existing tools that I already have or invested money in them you still eul's use the 800 different tools that the customer has you have to integrate them and there's out-of-box integrations custom integrations via API storefronts that have integrations to give the end user which would be your client the ability to take those events and bring them into the instance as you see here there's all of these logos all have integrations there's always additional integrations out there every every day every couple days you know you go out to the community site you can see a lot of customers will just create their own homegrown way to integrate because all you're doing is actually just writing to the event management table that's it and the instance takes care of all of the intelligence which I'll touch on later with the event management piece so integrations not that difficult when actually doing the implementation and I'll share this later it's pie times I don't know if how many people have actually done item implementations they could go really easy or just extremely painful it's you one or the exact opposite so here's a little flow chart which I like to which I like to use event management what it touches event management touches a lot of the different components there's at least ITSM a lot of overlap item two ITSM so event management touches anything from service mapping to the analytics orchestration security anything writing the same DB so event management come in and update the same DB based on based on events that happen so it will reference event management references and updates datasets and all these different modules you know the event management route they call a service matrix so out of the box there is different severa T's and how did a map to you know what and user client is already defined in their existing you know what does critical mean it's this is a good chart to help relate events to what is currently deployed and how that workflow would look here's an example of events you know how it event comes in so an event source can be anything from you know any generic JSON file you know SNMP it could actually even be email you could actually anything that would write to the table there's if there's a way to get the data in write to the events a.m. table that's how an event once anything's get gets written there a process a rule engine will run and we'll create the event events come in and an event is not an alert an event comes in and an event would look like something to the effect of so this right here is a lookup on the events a.m. table events will ingest anything having to do with obviously events and alerts as well as metrics so if some customers will collect metrics which could be important to them especially when it comes to enterprise type oracle type applications or something that's mission-critical they might be collecting metrics and Windows and Linux boxes the event management has the ability to take in the metrics take a look at the metrics take a look at the events that come in based on the events that come in and the metrics take a look at the date and time stamp there's what they call Oh I or operational intelligence built in which it will take a look and say all right every Tuesday I'm using this as an example every Tuesday at 3 p.m. invent comes in for this Apache cluster alright and at 2:45 we know based on things that we see from an event perspective as well as the metrics that event kicked off at 2:45 certain script runs CPU pegs out at 90% and at 3 o'clock it goes down once he starts is start seeing these anomalies it will actually do predictive analytics and we'll be able to tell you that there's a certain percentage chance that X will happen when Y occurs at this point time so combination with the metrics along with the event management gives you some insight into predictive analytics so here's an example so SolarWinds if the event would come in the event actually looks something like this is an application called postman used to post things by API so I have a little thing right here that I'm going to post this to my instance oh it says a source node type resource event how severe it is and a description so if I write this - let's see I believe it was my other instance so as you can see here this one says solar winds node Linux type you know what you know what happened resource metric name and I have something similar here so I'm going to say that I go to my dashboard and this dashboard is a rollup of all the events that are going on and how they would map to business services that have been defined so if I wanted to emulate something like this and send in another alert pretending that this specific see I this server it looks like it's a Windows server is having some issues with CPU I could do that here and node and I'll touch on this in a couple minutes node is the equivalent of the what would be the CI name or fully qualified domain name of a server if it is a server that's what you would put there in that field for node and then the CI will get looked in the seam looked up in the CMDB and that's how it does the correlation is with whatever is a node but I'll post it anyway and let's see what happens no wait description alright alright so it just sent the event so this is the one I just sent in it didn't droop it so usually what happens is a good example here so what a lot of customers like about event management is rolling everything up into a single dashboard so what event management has the ability to do is it will do correlation so it will correlate based on if it sees a lot of very similar things going on or affecting a CI in a short period of time so if even if it's from different sources if it affects the same CI or same business service what will happen is it will actually group them together so you can see and get to the root cause a lot quicker so when a CI comment' comes in and it needs you need to alert somebody there are rules that are put in place which get run so when event comes in it's going to take a look and see what CI is affected once it sees which CI is affected based on information that it has from your CMDB is it released a relationship so discovery or service mapping and it knows what services are connected or defined by that service what the run that what they're running on what the application is talking to basically what everything relies on so if your services are mapped but you can do or even if you have like a manual business service that is mapped so you have manually defined relationships the other relationships defined it's going to show you what is affected so if you have a service map and you click on that service map any component that breaks up the business service if it has been affected by an event if there's alerts outages changes so any type of changes that are going on root cause if there's actually an anomaly or issue going on you could actually click on the CI and if there is an impact to anything beneath that it will actually show in the root cause CI this tells you that you know there's a very good chance that what's causing this issue is that specific configuration item that is a you know the service map view exactly what you see you could see the alerts impact changes so this is nice especially when it comes to event management because you can actually go back in time across the top from left to right and any one of these maps this is like TiVo so it's that timeline so I have it set to months so this is a month months worth of time and red is obviously bad Green is good and of course here I didn't reset the date and time either but what you can do is actually go back in time and it will if anything is changing the environment so if an event had just occurred you could actually go and rewind and see what it what had happened you could go back and see all right and alert happened here but it was green at a different point in time you could actually do a comparison and I don't have anything green right now so that would be great you can actually do a comparison and say all right what's different between these two points in time and it will do a compare on date time stamps of what and that's CI compare that to what was written in that Vince table and then show you exactly what had happened here's the differences here's what it changed and it's very good especially when they have if you're change management is good or not good this gives you some insight so if something had happened over the weekend nobody documented it you could see all right well last week everything was fine for the last month all of a sudden starting on Sunday this server farm so decked and wonky why is that you can actually go back and see so you're comparing the CIS themselves so by was - I would compare this business service customer management and take a look and say alright what alerts or changes to every one of these CIS it has happened recently or is planning on changing at this point in time just like a dating timestamp compare that to this state of those CIS on whenever it was so on the 11 of last month for example so it's looking at the basically you're comparing two points in time and it's showing you the alerts that happen during that time frame as well as any change records but it's not actually looking at like the discovery results of that particular CI to compare no no yeah no yeah it's not going as deep as like take a look at the discovery log okay and see what has changed it's just the state of active alerts incidents just helps you narrow down hey we did a change we had an approved change at this period of time that might have caused this problem yeah and if you have metrics which might be calculated or collected you could actually take a look and apply the metrics here - so you has so you could say all right boom I'm gonna take a look at this date and time something happened and you start dragging the metrics that are being captured so that those actual CIS so it gives you more insight into what's going on underneath the covers and you could tie that even further into if it's actually discovery you could take a look and see what processes are running with discovery so when discovery was running or when this CI was discovered here's the running processes so you can actually see the process is the pits correlate to have CPU core that two two events and you could actually pinpoint what happened pretty quick the trick is getting there and helping the customer get to the point where they can make sense of all the different tools that they have in place and usually the best way to do that is to start ingesting all of the alerts so our events so when you take an event in so event rules are exactly that the rules that are applied to events when they come in and when you're a net new deployment what you want to do is ingest work with a customer figure out which ones potentially could be important once they start getting ingested it will actually start making recommended rules based on the information that's coming in it will say hey as you can see here hey take a look at this you know we see for those twenty nine with this that looked like this maybe you should make a rule so you can say alright let's do that let's make a rule the rules are really easy to make it's what do you want to name it and we will call it test perfect source so what is the source going to be and what order do you want to run the rule in so you could do nested rules if you would like which are very helpful event filter so these are populated based on the recommendation that the instance gave me said all right description will meet regex this is actually over here is a really nice regex GUI so you don't even need regex so you could actually create some rules without regex you could actually just highlight you know click on whatever you want that selection to be named as a variable and then it'll do the regex for you and say alright capture this here's this string has to match this but in this section I want to capture the name of the server you could actually just highlight it and it'll write the regex for you so you just create the filter has to match these components we don't obviously need it to be this granular resources mid servers and metric name is this and if we want to transform it this is where you do your transformation so when it especially when you're doing your old custom or third party you're gonna want to match the fields and it can be done here a lot of these are pre-populated because again it just uses the existing rules that it's recommending and then this is where you would put your threshold so if you don't if you this is where you would apply your filters to filter out a bunch of your junk so this is where you could say I I want to filter out this one end point from this one source because 99% of the time it's just flapping and throwing me junk so you could say all right I'm gonna filter out the junk and I'm gonna make this active but once it hits that hundredth time and two minutes then I want it to alert me and then this is the binding which I alluded to earlier when I was referring to the node the name of that field called node that node is what it uses to map to the DNS name of the server or endpoint this is where you you would define how it figures out what you're talking about in the same DB this is binding B's and then you would hit save there's my rule what I recommend is if they don't have that I'm put in place is you know threshold alerts for you know CPU you know memory network connections down sometimes they'll have customers will have agents which will do monitoring of their their web portals so it will ensure that their web portal is responding in a timely fashion along with latency and it'll so I helped them make create those rules which will do the use those threshold triggers say all right if I get this these types of responses and this many minutes with this count whether it's you know page refresh is over 20 milliseconds those basic ones that are usually the ones I help them set up when you initially go out to the deploy events working with the customer just to figure out what should you know what's noise so that goes back to the whole threshold piece where just help them mitigate a lot of the noise and worked with them on the hardware event so anything that happens something that that happens frequently in something they have a process for implement those first because that gives you low-hanging fruit some quick wins deploy those they could they learn how to action upon them they know how to use the system teaches them how to use the instance and then move forward with the whole change whether or not nodes up or down and then go into the you know those metrics with the performance thresholds alert correlation rules so you could actually create alert correlation rules that are not out of the box but a lot of the times you take a look at all of the alerts the instance itself will actually group based on severity so severity has to be the same sources will be different node or application would be the same depending on what it is if it's an application so if it really doesn't have a CI mat to it but it has components of a CI underneath it so a business service in that case would be the same and and then underneath it but it will do is it will actually draw the correlation so if you take a look at oh yeah here's another thing - this is where it gets a little confusing is there is a I still to be honest with you still have to completely figure this out is root cause analysis configuration so this is where it actually does all the intelligence and will try to determine root cause analysis with some algorithm behind the scenes and it basically chunks all this information and then comes up with the root cause and this work and goes into the historical analytics and will make recommendations like I said to be honest with you I don't know a ton about the root cause but RCA that's what the root cause analysis does other than that I don't have a lot about the configuration behind the scenes I don't know exactly how that's configured event rules is really just about translating data coming in from the outside into the record that gets put into the actual event table yes maybe maybe weeding out you know if there if you know if your your bot is spamming you here's a thousand of the exact same rule over the course of two minutes well we only need one so only put one event in the table what handles whether or not those events across multiple sources become one alert or more than one alert is a separate type of rule and that's the the alert correlation rule correct if you take a look at the events here's events that came in an event comes in that gets processed the process itself it'll tell you how that prevent came in and the process that went through to see if there were any rules applied to it so this node will be resolved see I type is empty you could actually create an event sample or check the processing so if you're trying to figure out all right I'm trying to create this rule and I can't figure out why it's not binding the CI you could actually see here why it's not processing create a new rule pretty sample but the whole point we've shown you this is too there's events come in it gets run through a couple different engines tries to figure out what to do with it and whatever this is the actual flow of what happens when that comes in it's you know is there a rule there's no rule it's just gonna write it's the event and transforms it figures out the threshold Maps it tries to figure out what see I created a net new alert or updating existing that's it so I should I believe I showed you guys the alerts console so the alerts console is as I showed you earlier I kind of showed you this was the which is all the alerts that came in right and the node so the alerts console is a kind of look think about it as a the all alerts view but kind of rolled up into something a little bit more intelligent so it will actually show you things that have been correlated or grouped together kinda like what Jody what you and I were just talking about is it will take these and group them as you see here group alert if you expand it it will show you different sources scam and Nagios but still it's all under one alert because CPU sees an issue with memory storage capacity something's going on so it just kind of grouped it up and said hey you got something going on here this is a rollup it'll show you the group alerts whether or not it's correlated you could actually see a time frame unfortunately I don't have a lot of data in here but this is kind of this would show you how the instance keeps track of everything you see the different alerts even if it's from different sources to the same CI in a specific time frame once you start seeing all those patterns here that's where it kind of gives you some insight into some of the intelligence that's going on behind the scenes to make recommendations well that's all I had and I appreciate the time thanks everyone [Music]
https://www.youtube.com/watch?v=OCj32vk_cUQ