logo

NJP

Beers With Cloud Engineers - Episode 24 - Integrating Cloud-Native Apps, Featuring CSDM as Code

Import · Apr 26, 2024 · video

all right fabulous well everybody Welcome to beers with Engineers session 24 um so very excited and we officially celebrated the two-year birthday uh on St Patty's Day earlier this year um but this is kind of like the two-year of of sessions um really really nice to be here with all of you um I tell everybody this is kind of like U hanging out with my friends at work every month you know um so it's it's really really good to be here thanks for joining us today um we're really excited about um what we have to talk about today um you want to kick us off to the next slide will and we'll talk through process and so of course we always have to start off with the Safe Harbor notice um we always talk about things uh in here that potentially may not be fully developed it may not come to fruition exactly is the way we talk about it because it might not be GA yet so um everything that you hear here don't make any buying decisions on it um it's really purely just about you know um how do we enable you guys um so moving on to kind of the agenda we always keep it very uh casual and informal here um and we'll kind of dig through these as we go why are we here who we are we'll uh hand it over to Matt mois with rapd for a tech Deep dive that I'm very excited about today and uh and then as always we'll wrap it up with a Q&A certainly you know anybody who has questions all throughout feel free to come off mute if you uh if you have something to to contribute or to ask yep 100% it's always casual and informal here and so why did we do this right we we built this because we really felt like there's not enough um Community communication around the kind of intersection of service now and Cloud native uh functionality and capabilities so to really build a community around what are the capabilities and what are things doing talk through on a very deep technical level how things are operating how things are working and then um you know to enable you guys to also have a community to talk through um with each other so um really really happy that um so many people have joined us on this journey yeah so who we are um I'm Mike gager I'm the manager of the Enterprise applications team at drw of principal trading firm based out of Chicago um I am a a Tech head from way back um and I love pretty much anything nerdy um I love to do specifically IP operations management and kubernetes are kind of my two favorite things to play with on a regular basis I know it seems crazy um I am a big on kind of solving business problems really intelligently and efficiently with tech um I am a jiujitsu nut job um I I love it uh dearly and I play board games with my family on a very regular basis lots and lots of fun and actually I'm I think I'm going to train some Jiu-Jitsu while I'm out in Vegas for knowledge this year so anybody else who's interested hit me up uh I am not gonna drink a beer today again because of the jusu thing so uh stick into water today will over to you thanks Mike hey everybody will ham uh it Tom architect here at service now U Been Working in various parts of it for the better part of three decades um currently kind of focusing on it operations management a lot of automation stuff specifically around Cloud native Technologies just love to find repetitive tasks and automate them so that nobody has to worry about doing them by hand ever again uh my spare time I I like hanging with my family play some pickup hockey and video games uh today I'll be enjoying a single order hazy IPA from Community Beer Works um before we get into our main uh our main program just want to give a quick shout out it's something that um we did last year and folks were asking about it the first couple sessions this year we are going to have a live beers with Engineers session uh thanks in in no small part to our friends at rdev so here's the information on the slide here's a handy QR code which will take you to the signup page uh it's um you know space is limited so if you are going to be acknowledge please uh you know don't don't uh don't delay if your schedule can accommodate really excited to be uh doing another live event with actual live uh Beverages and um yeah it's going to be it's going to be pretty cool super excited and definitely share right if you have other folks on your team that are going to be out in Vegas who maybe don't make it in for these webinars definitely let them know so that they can come join us yeah 100% um I'm told the room accommodates 150 people so um we've got you know we we've definitely got some space for some folks so spread the word if there are uh folks with similar interests that are going to be at knowledge so with that we're GNA hand it over to our our esteemed guest speaker um Matt Morris from rapd Matt uh want to do a quick intro and then the floor is yours appreciate it yeah um happy to be here I used to work with both will and Mike when I was at service now um and excited to talk to you guys today about just kind of tying together a lot of different parts of the platform that uh we don't tend to see really good endtoend demos sometimes of these things and so since we're having this conversation a lot of wrapped of um with our customers and our prospective customers that we're talking to it was something that um I've shown to some people that service now lot of excitement and different kind of uh audiences um because like I said I think it is kind of a unique demo in that sense and um really just helps show what's possible when you pull together a lot of different pieces of the platform um it is a live system we're going to be demoing today so I'll show you a little bit about that as we go ties in a lot of different pieces including open Telemetry devops change velocity for change registration from cicd pipelines it ties in um a solution that we have uh which is patented and it's on the store it's called csdms code it also ties into um event management major Incident Management some chat off stuff and some automated remediation so we'll show kind of that in to end story um just quickly who rra up is I'm not going to bore you guys with slides um but we are a service now Elite partner um we have uh a lot of different customers in all different industries that we work with um and we have quite a few custom um plugins and Integrations that are on the that are on the store our specialty is especially like an Ops space a lot of us came from um like Cloud Ops Cloud engineering kind of uh you know different types of opst roles like that SRE some of us were engineers before this and so um that's the specialty that we bring to our customers and uh really making service now sing as part of that whole ecosystem is what our specialty is we've done a lot of work with a lot of different customers um you can see some of the numbers here as far as uh you know the different types of situations that we're in we do a lot of kubernetes Discovery I know kubernetes is close to Mike's heart uh we do a lot of automation uh which seems to be close to Will's heart based on his intro um and uh these are the things that we work on every day with our customers um you can see some of our customers here so we've got a lot of customers we've worked with um some cool success stories that we'll be talking about we have several sessions at knowledge and we'll be with you guys that beers with engineers at knowledge as well so forward to it um part of the way that we're successful is we just bring a Fresh Approach to deployments we don't do like these big bang deployments that take six months and then you finally have something we deliver value along the way so that's the little spiel about who rapd Dev is today what I'm going to do is I'm going to start with a working system um it's a website it sells astronomy gear and it's um built on open Telemetry so we'll talk a little bit about open Telemetry as we go I'm going to make a commit and break something and then we're going to show how does that change get registered um how does the deployment then finish once the change is approved by devops policies um we're going to show how do we register services in a declarative way and as all of that's happening and we're understanding the connections between our cicd tool chain and service now um we're also showing how do we interact with service now without even being in service now like Engineers tend to not love to hang out in service now the service now UI all the time they want to be able to meet it we want to meet them in tools they're comfortable with so we do that once this gets into prod it's going to get discovered by a few different tools um including service now Cloud observability which is uh previously known as light step um will'll have generated AI tight in along the way to help with summaries and troubleshooting and things like that um and validations of um you know groupings and things that I'll show you we're going to work on that incident through teams since it's a major incident we have kind of some chat Ops kind of stuff we can do and then as we get through and we're figuring out what problem is we'll actually do automated remediation against that issue um and then show how does this all tie back to D metrics and things like that so what's the value of this whole system in terms of bringing it all together into one place if there are any questions feel free to drop them in the chat as we go and I'll try to keep an eye on that will and Mike might be able to jump on some of those as well but this is the overview of what we're going to run through I think it helps ground us a little sometimes to look at a slide since there's several things we're going to move through and what I'm going to do now is show you the website and how I'm going to break it um and then I'll do the commit to break it the pipeline takes like three minutes to run so we'll talk through a couple of other things while it's running this is the website it's essentially an e-commerce website um it has kind of this front page with products that are popular right now we have a bunch of traffic constantly generating against this website um and so that is what the customers are doing is they're coming in they look at a product they might add it to the cart and then they can go through and like fill out the stuff and submit it and that goes through um shipping payment tracking fraud detection everything like that um when we're on a product page you might have noticed if I scroll down a little bit there's actually this section to called you may also like that has products that I might be interested in based on the fact that I'm looking at this one this is a heavily micros serviced kind of architecture that we're looking at here I think there's 13ish services and probably 10ish languages um the the microservice that delivers this particular little widget with the you may also like and the products is pulling from the product catalog but it's actually called the recommendation service um and it's written in Python so that's what we're actually going to break right now um that's probably enough said about the website I don't want to spend too much time on that and I want to get into the commit so that we can see how that flows down once we break something so I'm just going to edit this right in GitHub um right in the browser for Simplicity sake and to make it easier to share um I have basically a section in here that if I uncomment this it's going to break this service and I'm not going to explain anything about why I'm doing this or um what I'm doing exactly when I commit this I'm just going to put the only thing I'm going to add to the commit message besides what GitHub autogenerated for me is I'm going to put um a reference to an Ado story that I'm working on right now and that story itself is actually super vague too but but it will show how we can connect in service now how this gets connected to the change to show the related story um so that's one little nugget about how devops works the devops plug-in and service now but I'm not going to explain anything about what I'm doing or why I'm doing it and the reason for that is we're taking all that work and all that Reliance off of human data as much as possible in this kind of an implementation and we're using instead we're using tiin to generative AI to validate certain things and to figure out what's been changed and so on I'll go ahead and commit that change while it's running I think I saw there was at least one question that popped up nope it's just saying to register for beers at knowledge definitely definitely do that um I heard that there's going to be unlimited beer uh for free is that what is that was no I'm just kidding I'm just kidding but definitely register for that something like that so that something like that so that uh that pipeline's running while that's running I want to highlight a couple more quick things about how this works under the cover because any second we're going to get the uh ping that says like hey some stuff's broken and then we'll jump into figuring out what's broken and how to fix it while that's happening I want to show you a little bit about csdm as code again this is a patented plugin that we have on the service now store um wasi designed by actually employee number one at rapv um and so essentially what this does is it allows us to um gather really important information that can often be hard to get about the services that we're building from number one from the people who are the best position to provide information and two in a tool or in tools that they're already comfortable using without trying to teach them some other tool what I mean by that is a lot of times to get this kind of information about what service are you building who owns it how do we know what you know the tags are and things like that these kind of answers we would normally have to get from um you know somebody in the platform team that works on service now trying to go out find the people who work on the service ask them questions and write it down and either bring that back to service now or tell those Engineers to hey come and figure out how to use like a new UI and service now where you have to try to figure out how to register this whole thing and our experience with teams that are moving fast and uh honestly breaking stuff right in healthy ways um is that they don't want to get out of this tool they don't want to get out of their flow and so what we're looking at is how can we make that friction minimal and still get what we need to do um to be able to drive some of the outcomes we care about here which is keeping Services running keeping um our customers happy and making sure that you know in the process the friction and the cost is low on our organization because we're not paying Engineers to provide service details into service now we're paying them to do the development work right so that's all we do here we provide a schema a schema is tailorable we're going to talk more about this at knowledge so I'm not really going to spend a ton of time on it today but this schema can be tailored in the Imp M ation um and you can even have like mapping you can set up and stuff like that but all we're doing here is we're just defining we need some attributes about the services that are being built in a given repo we drop this in the developers are actually working on that repo fill it out they give names about like the general kind of business application that this supports who owns that who should the tech owner be if something breaks or if A Change Is logged against that who should that be routed to what group and then we talk about the actual instantiated like micros service itself which here we're looking at the example of the accounting service we have one as well for the recommendation service and so we can see here it's like basically the same just different names for the services and different tag values so we're mentioning this is the service this is who it's owned by this is who should deal with it if it breaks um these are normally kind of hard details to track down so the fact that we can get this really easily here in a place that's comfy and easy for developers fill out it's a game changer in that sense um and that's definitely what we found is we've deployed this for our customers too so Matt just to make sure I understand this right so if the team that maintains this this app decides oh we want the owner like the the tech owner for this to change all they have to do is just change this code put a PR in and then it'll automatically through their same release process and then once that PR gets merged it'll update the cmdb is that yep yep all we're doing is we can do this completely separate from devops in this case it is linked up with devops but it's basically just a web hook that's set up when a pipeline is run when that pipeline is run we look at the files that were involved in the pipeline run and if they match the naming convention that we're using here which in this case the default is now. yaml as the end of the file we go reach back out to the get tool and we say um what what are the contents of this file we bring that back we parse that into service now right into it csdm compliant it matches all the recommendations as far as where data should go so that the magic happens right and so um that's what we're doing and actually for this layer here this object the microservice we're actually building out the tags for it as a tag based service and so that automatically creates a tag based service in service now and I'll show you in a minute but as resources come in from whatever sources whether it's cloud Discovery or like service craft connectors um or maybe like VMware Discovery or whatever and they have those tags um they will get Associated automatically to the services that were created through here so I'll pause there and see if there's any questions so far yeah so I wasn't sure if that's okay to ask a question um can you hear me okay um so you mention um that taking care of a new uh value for that attribute in cmdb on our side what's going to happen on on the other side is there no kubernetes or whatever are the values going to stay there the old ones or do we have ability to update them um are you saying in service now will this affect any objects from other for other CI classes no no my concern if we upda in those values in cmdb only and the source system has the original values um you know the next time when we're going to source somebody goes into the into the table and service now and modifies them and then somebody so like dueling dueling updates I yeah I I would imagine you wouldn't want to do that right because it would because yeah we we sometimes set um like flags on these so for some customers that we've worked with we've set flags on these too where we can make them like read only so that's an option um or you can do some flag on them that notes they're from this Source um but it's also not that huge of a deal like if some if someone were to come in and change um one of the attributes uh this is something that whenever it's updated again it it'll overwrite those changes um and in an implementation as usually those those are the kind of questions you have to answer about like how do we resolve conflicts um in this case these are largely services that you wouldn't be using any automated Discovery for so it would have to be a person and so that comes down to what is the person process for that yeah this is a data governance question for sure there's also one other question that uh popped up in the Q&A um does csdms code uses cmdb C sdlc component class as part of the build domain in csdm yes the main classes that this touches um in terms of how we've built it out of the box for the store are business applications splc components and tag based uh application services and like I said it is extensible um definitely plug for beers with engineers at knowledge because um my coworker and the guy who his his name is the first name on the patent he's gonna um be doing a session about how to make this extensible how to use it for other types of use cases and stuff like that um because it is completely extensible in that sense you can um map to other classes you can bring in other attributes you can do transforms and lookups and all that kind of stuff oh Matt is it something we can establish the CM dbci relationships as well from this file to other applications yes this does the relationships among the attri among the objects that we list here and it also sets up a tag based application service which tag based application services do create um literal relationships between the parent and the children based on the tags good questions guys please keep those coming so this is the outcome right we get to a place where we have like all of our all of our declared services that came out of the ammo files and in this case we're actually running there's a a plugin called open to imetry service graph connector and we're actually connecting that to that here so we can see like if we were to open one of our services um that we know has a pod all the time like our payment service um the payment service here can see like if we look at its service map um we we're not doing any kubernetes Discovery in this environment so we're not doing it either through like another service graph connector or through kubernetes Discovery itself this pod was actually created by the open Telemetry service graph connector which basically takes Trace data um and converts that into objects based on different different kind of uh um approaches for looking at like kubernetes um attributes and things like that to be able to create the different kubernetes objects s and then here because they have those tags from kubernetes they're actually mapping up automatically to my tag based application service that's created from the payment service so this is where we get down to a service map where this is helpful is I may get an alert from a tool from a monitoring tool that says hey there's a problem with this pod you need to go check that out and then that will map up and say oh well that affects the payment service and also affects the top level astronomy shop that we're dealing with um so if I'm an engineer right couple of quick things that I would probably after I did that deploy I wouldn't have been looking at all that other stuff I probably would have just come over here and said hey what's going on in teams what's the latest chatter who's you know who's going to be going to beers as engineers and knowledge and I'm checking all that stuff out I come over here I see hey my change was created automatically looks like for the deploy that I did I can see it's my my Hotel recommendation service Pipeline and I can see it was automatically set to approv so honestly from my perspective I'm good I'm not really worried about anything else I know that that change was deployed but um had I been here a little sooner I might have seen it pop up we do see there's a new channel that showed up here and this new channel is telling me that there is a uh an alert that's come against or an incident that's come against my uh my um team because it's creating a channel here and it's automatically assigning me as one of the people to respond to it so based on the fact of all the setup that's gone before and everything like that I know that it must be bringing me into it because it's about a service that I um deal with I can see it is my um otel demo kind of group of services and it's telling me there's a problem with the front end um service which is the the service that does a lot of the UI and stuff like that and so um I'm G to go ahead and start responding to this and I have options here where I can actually use like chat Ops kind of command and I can just say let me add a work note this is something we've deployed for several customers too there are some out-of thebox capabilities for this with service now you can build out um and we have some specialized kind of things that we deploy for customers usually too I'll say hey I'll take this one and that'll drop a work note on um the uh the incident in question here and we'll go look at that incident in a second before we go there um I might also want to know a little bit about the history of this CI so I can run a command we call it get CI 30 basically tells me about the last 30-day history of this particular CI looks like it's mostly just had changes against it so this is pishing in the direction of it's probably the deploy that I did um I'm going to want to figure out how to undo that as fast as possible the last thing I can do here is since we know there's probably a code fix that needs to go in just off the top I'm going to go ahead and add a problem to research this further while I'm doing the best I can to stop the bleeding with the incident so that problem task um and the problem itself will be out there to make sure that um this issue gets resolved uh permanently so let's hop in and look at what do this like on the service now side for this incident that I got I'll pause here and see if there's any other questions so far cool so we can awesome so we'll see here where the team spot came in gave us a note we actually got a little bit more detail in here too so this is our first example where we're looking at an example of a chin AI kind of tie in and what we're doing here is we're pushing out a lot of content this is a service now feature called alert assist where basically you push out a bunch of content about the alerts um and you get back kind of a summary of what's going on and we've extended that a little bit in said hey also give me ideas about what are the top three most likely causes for this and how I might about starting to resolve each one because of the level of detail that we're able to push out um about the logging alerts that we've gotten here any other alerts that came in it's telling us the first most likely thing is that there's an infinite Loop that's been introduced and so it says hey uh go remove that it's a good tip um and at this point we might want to dig further into like where is this coming from where's what was the source of this incident it tells us over here the origin was a group of alerts and so we can open that group of alerts to learn more about what's happening here so we see immediately there's six alerts in this group um the most recent kind of kind of symptom is that our front end is not responding sometimes uh that's not good it's probably because it's getting overloaded from things retrying those services and stuff like that um we can see the main effected services are the front end and that recommendation service and then the top level service that's impacted our astronomy shop that we have for production um so if we start to dig into this right if we look at our details we get a similar kind of summary of what's been happening so far we can see our incident that was created and everything like that but I want to look at the related records to understand more about the underlying causes that are happening here so first of all I can see the alerts in the group I have several and I'm going to look through them a little bit but I can see like if I sort these by generation time I can see the first one that happened was actually this log error we see from log analytics this is a product that service now calls HLA if you're not too familiar with it it basically ingests all your log data and looks for anomalies so it found that anomaly first before any of these other tools started to notice that something was wrong and it said hey there's this stuck in Loop error that's that's getting repeated then zic started to notice some latency and I want to jump in there really quick sorry didn't interrupt but the log analytics functionality is actually something we'll we'll also be deep diving on at knowledge so um come to beers live at knowledge and learn more about that as well yeah totally should have made it a drinking game you got to take a drink every time somebody plugs the beer live I'm doing every time somebody asks a question so if you guys want make this more fun ask ask more questions all right we'll do our best that's right well since since you asked the Gen AI piece that you just showed is that it sounds like that's like a value ad that you put on there right you're making a call out to what like chat GPT or something yeah we have um a couple of different LS that we use we do use some open AI stuff um I use some anthropic stuff and I've just started to mess around more with llama 3 um so there's some pretty cool some pretty cool options for those um we've done several different versions of that for our customers in terms of solutions that they can that they can pick from uh could be you know you're dealing with just public GPT depends on what you're pushing out and how comfortable you are with it going out to the public model course you can always tell the model not not to train on it and then you can do like a private model um would be an option as well or you can use um like the now llm right is another option so there's several options depending on your use case which one you pick um is going to kind of depend on on those factors there's also like cost factors involved and all that kind of stuff um so that's that's how we look at it yeah pretty much everything I'm going to show with Nai is things that are solutions that we're building um but we're using these building blocks that um service now offers as well and that's part of what schoolb service now as always right is you can extend it we are a big Lego set absolutely so we can see that log analytics alert um we can see a couple of other alerts that came in from service now Cloud observability selenium started to notice some um some synthetic tests failing and then um cloud observ dropped in one more because it was a little bit different version of the error rate so part of the thing that we can do the part of the power of event management is bringing all these alerts together and then part of it is helping with root cause analysis right so um just a couple of quick things while we're here if we were to open up one of these alerts like this log analytics alert we do have um a little a little button that I added here too where you can open these logs the specific logs that cause this and service outloud observability and say hey what's going on with these logs and that allows me to really see like there was nothing happening at all and then all of a sudden I have a deluge of these logs that match this um kind of phrase that HLA found and so this is just again get back gets back to how do we pull all this together I'm going to show a couple more things like that so we've got our uh our like uh latency for a recommendation service something that can be kind of hard to catch sometimes in observability can be like things that are that are broken but they're not like blatantly broken and so what we're doing here is we're actually using looks like there was one really really bad one it's kind of compressing the let's look at a lesser time frame there's one really awful one in there if we zoom out let's do 30 minutes and maybe that'll give us a nice feed if we zoom out a little we can see we were doing fine and then everything fell off cliff and now we have this super low latency this is a good practice as far as monitoring to look for um to have some policies in place that look for like just weird um outcome signals kind of output signals like latency um both too high and too low because too low means something totally different right it means it returned pretty much immediately which means it probably didn't do its job and that's kind of what we're dealing with here and because we have some of these other alerts we can actually sus that out even further than just to say that like it's obviously not working um but it's not broken in that sense like kubernetes thinks it's working um by all like by all by all accounts if you looked at it in kubernetes kubernetes thinks it's working fine um and in many other cases it would look like it's working fine right but only when we have those kind of like catchall kind of things that help us catch unknown um unknown unknowns type of situation um I want to look to at the probable root causes and we have another tie in here so two generative AI tie-ins on this piece we have probable root causes that are coming in here um and we can see like hey what a probable root causes we're actually doing gen validation on them to see which ones are probably the most likely to be causes that are um connecting to the issue that I'm dealing with um and I can also see um of those like the most recent change is probably the one that I want to look at and on that change this is the one that was created by devops I know that because it's telling me it's automated we have one more thing here so under related records let me just highlight this real quick um we do get in the um change request itself we get a bunch of related records that get attached one of the things that gets attached in A devops change is also the uh the commit details and with that what we do is we reach out push those change uh the actual code changes that were done to an llm again and say explain to me what this did before and what it does now and that's how we get this bottom section here as you remember in my commit I didn't explain why did I do this what was the point well what am I actually changing and so this is where the code changes are actually getting highlighted here you can see where it's saying there were um there was a while loop and a logger message that were commented out but then the infinite Loop has been activated parentheses uncommented um and it points out that these changes will keep the function in a constant state of attempting to retrieve the product cataloges and it's going to be logging that error message continuously which is exactly right and this to me is like one of the powerful applications of generative a things like this as opposed to um like in this case we're not putting any decisions in its hands we're not putting any kind of super sensitive information out there um we're basically just using this as a way to inform us better about what's happening any questions on any of that part so the payload that you're sending it are you creating the diff and sending it or are you sending it kind of the two generations of the file and telling it to just compare them yeah it's a good question um because uh and you could do you could do either one and in some cases I think sending the whole thing would be super useful but there would be a little bit more architecture to that because um devops change velocity doesn't store the code files it just stores the diff so I'm just grabbing going and getting the related diff um by you know walking out and finding the right record and then bringing the diff back and putting that into a longer prompt that gives a bunch of instructions and then getting that explanation back and put in change so the diff actually gets generated by change velocity yeah yeah because it pulls it from from from GitHub or whatever tool you're using for code and I think that answers mahesh's question in the Q&A as well around whether or not this uses devops velocity so it does that's how this change was registered yep there's a step in our pipeline um that actually does the change registration like if we look at the actions um we can see the one that just ran and we can see kind of the steps that it's going through so in the build process um it's doing a setup job and then in the deploy it's doing the change um as one of the first steps and then it wakes on that approval um before it proceeds and the approval in this case there's a policy that we can see in the change itself we scroll down a little it passed all these checks so like we have some tests that are done automatically amount of code changes we have a policy there this is policy is code stuff commits without stories we didn't have any because I I did attach the story no problem tasks no incidents no p1s this is DPR if you guys haven't heard of DPR it's like a newer product that service now recently launched and so we're actually doing a release Cadence here too where you can reference a release and check if there are any release tasks that are pending um in this case we didn't relate it to release there were none but if we did it would check that release to see if there were any so yeah so all of that leads us back to a place where you say yeah this is obviously being caused overe explaining we've spent a while explaining this if I was looking at this as an engineer I would have looked at this in a couple of minutes and been like yeah it's obviously caused by the deploy so I'm just going to go right in and start rolling this back right that's the first thing I need to do now because um service now offers this Playbook functionality we could build a flow like this one that we're about to use called rollback rites deployment which basically takes this issue and BAS B on filters and rules and stuff like that it Maps up the most AC most likely actions that I might want to use to solve this problem and then I can choose to let these fire automatically so it could have tried to do this automatically um in this case I've left it up to an engineer to make the decision and I'm actually just going to click that button and do the roll back and what this is doing is reaching out to the cluster uh which is in um Google cloud gke and it says hey this deployment the recommendation service you need to go back to the last version um and so it does that it executes that successfully and we would have seen before we didn't really look at it too much so far but we would have seen before like our you may also like section had completely gone away I forgot to show that little part but it was gone and now that it's restarted the service is actually back um this is where the value of it comes in because I have this ability to do all of these kind of pieces without ever having to come around and figure out like hey what are all the steps that I need to go through how many different tools do I have to jump around how many different pieces of context do I have to manually kind of pull together and so at that point um I know that I've resolved this issue service looks like it's working fine again and I can then go in and I can actually do the incident resolution here too and so this will bring me over to um a chat window so send me a quick message in chat and it'll say hey um you need to fill out some more details about what you're doing um you know to resolve this incident what the steps are I think teams might actually just Frozen let's see using teams in the browser is not the best experience just makes it easier for demos true yeah here we go so we say work around provided I'm going to say here um P task code fix and and um manually rolled back to restore service it submit that as we submit that we can see on the service now side like here's our alert and our incident the alert group got pulled together by all those CI relationships that we helped with it automatically um resolves the incident sets the resolution code and as a result of that relationship it actually closes my entire group of alerts and marks all of this as is cleaned up so that on pages like um the one we have here where we can see all of our services um my astronomy shop would have been a different color before so there's a few different uis we tend to do a lot of work out of like chat Ops but there is also this piece where we have like our astronomy shop and we can see it's back to green now so when we have alerts that are affecting this shows up as you know blue or yellow orange or red depending on how bad it is so it would have been red since we did have some critical alerts and now we get back to the place where we have it uh green so all this leads us to a place where it becomes much easier to manage kind of the end to end story but it also helps us to um provide visibility that previously was impossible um and you think about the amount of questions that usually happen in a scenario like this where a customer calls a VP or something and that person picks up the phone and the customer's like hey um your recommendations service on the website seems like it's not working at all like what's up with that and then that person has to go talk to somebody else who has to go talk to somebody else who's like oh yeah actually there is a problem going on right now we're working on it here's the latest deal with that and here's when it's going to be fixed and they're like rolling that back up it just becomes a really painful game of telephone and everyone spends a lot more time trying to communicate stuff than it really is necessary um beyond that even like you have the visibility piece which is devops is super helpful in those changes in when pipelines are run but I think it leaves off a little bit in terms of like the fact that we have a code diff there is helpful for some people but lots of people would look at that who might be in service now looking at changes and they say I don't really know what this change is right I don't really know what it does and you can either you can go one of two ways with that you can push that to the developers and say you have to fill out like a ton of detail in all your in all your commits and all your PRS and explain every little thing you're doing um in a way that someone who's not as technical or whose job is not engineering would understand or we can just use generative AI like what we did here where we get that explanation back and we put in the change so I think there's a lot of value here in terms of like service now products and then like how can we build on those even more useful pieces for specific situations like chat Ops like keeping developers and tools that they're used to using and only exchanging the actual data that we need to have so um that's kind of the story that we are telling here and I think as it continues to resonate with a lot of customers that we work with um we're implementing this in a lot of situations and customers are seeing really cool value out of this I think um we've had a few customers who talked to knowledge in in Prior years about great experiences they had with implementations of solutions along these lines or using pieces of this and so this end end story can really make a big difference for an organization hey just to jump in there's one question in the Q&A um around the site reliability management and whether or not tracking error budget and error rate are kind of part of this process is that something you guys are thinking about yeah we've had some discussions about um site Rel site reliability management um and that um kind of uh you know newly launched version of all of that that was done in March um I think there are definitely some connection points uh a lot of the customers that we work with um um I would say it's not the first thing that's on their mind usually so they're usually trying to figure out a lot of other um things to solve before they get to that point necessarily but I think there's definitely some tie-ins there and I would say that um we do have some of those things installed here and we've used some of them in other situations we just uh it's more of a thing where this demo it already gets to a certain point we're not really highlighting a lot of that um but I think that there's a lot of value for that when the customer kind of maturity and their process like it's a good fit I think that's definitely a good option how would you expect this setup to respond if immediately after somebody did that commit you know somebody else on their team looked over their shoulder was like what are you doing you just uncommented uh an infinite Loop and they immediately kind of go back in there and fix it on the git side how would you envision this kind of catching up would it leave like a dangling alert or would it clear automatically but never really fill in the blanks as to how it happened in the first place or something in the Middle where it would kind of if somebody looked on the service now side they could see oh there was a change and then things got croward and then there was another change and now they're fine yeah yeah I think that uh because of the history that we can easily see on the CIS this is part of why CIS are a core piece of this um we don't leave CIS completely behind and uh even even in Cloud native scenarios there are ways to make CIS useful without getting to a point where it's just unmanageable um with temp with you know with very kind of uh ephemeral stuff and so that's where I think that comes in but what I would expect to happen probably in that situation is that most of the alerts in the incident would probably never get created because if they C it that fast in all likelihood the monitoring tools would have never picked it up but even if they did most of them most monitoring tools that we deal with have kind of a a close status that they send or like a clear status and event management picks that up and clears the alerts and that would resolve the incident so I think in that sense it would probably autoc close um if if the monitoring tools even caught the issue before it was fixed right yep so here's another good one mes asked around how are the microservices mapped do you want to dive into those service Maps again yeah sure yeah um so basically the way that this is done in the yaml um you have a few different uh kind of um objects that you can tie this to the easiest way to see this is in the um the easiest way to see this is in the in C mappings and so you can see when you set up the plugin um there's some default ones that are provided um which are based on just kind of like the most common use cases that we see and you can set up more and that's what I was mentioning somebody mentioned something about um using sdlc components we do have a default one that we do here um and so that's what we're actually that's what we're actually creating for the microservice object type and the application services that sit under those are TAG based application Services um and then we're also using business applications those are the top um csdn or the top cmdb classes that we're using you can set up more uh mappings um here for whatever you know other classes you might want to use and then based on that reference in the yaml that's how we're creating those objects and you can kind of see like if we open one of these um let's look at the application service one you can see um the support group is getting mapped from support group the environment is getting mapped from environment and the tags are getting set into the metadata field of the target class so this is how you you can build it you can build it so you have like lookups and everything like that um that's essentially like more about what that looks like and you can extend that as well to meet other use cases yeah and again everybody take a drink we're g to talk about that a lot at knowledge life be's live at knowledge um but mahes does that answer your question yes it does thank you perfect excellent sorry Matt you also mentioned that you've used the observability piece also in this yes can you can you a bunch of those alerts um were from service out Cloud observability um we piped all of the open Telemetry data because all everything we're looking at here is instrumental open Telemetry so we piped all those traces logs and metrics into um service outloud observability um and that's what I showed when I opened up the HLA alert into Cloud observability on the log page and had the it automatically built the search for us with that button um that's something where um you can you can use it like that for Log search uh you can also open I think I showed opening one of the alerts into the alert itself um in server now Cloud observability so you can see the performance once you have data in SEO as we call it you can set up dashboards you can set up um alert policies and everything like that and then what we do is we route those alerts to event management so that they're consumed as part of the event data and in this case like we showed we had six alerts three of them were from SEO one from HLA and then we have like some zabic and selenium stuff mixed in there too um and that leads to this alert group that allows us to deal with that as a whole rather than six like you know if we were integrating incident or we have six different incidents is the APM component of the absorbability is mandatory for in this whole exercise or can we do it with the visibility piece of the absorbability and the service graph connector of the absorbability yeah I think uh for the outcomes that we're showing here we're kind of showing an ideal state of being able to pull all these pieces together it's definitely a puzzle that you put together some pieces at a time and get to some kind of value um the service craft connector so I think I think what you're saying is what if we have open Telemetry data but we're not completely set up for APM yet can we still do service craft connector the answer is yes you can yeah because it's not based on anything that you've configured in in uh I wanted to say light step in service now Cloud observability nothing that's configured there has anything to do with how does the service map get built it's um it's a little bit of magic that happens behind the scenes where um basically condenses those traces into service dependencies that are detected and then um it has a like a backend API that pulls out the service dependencies and brings them in through um ihub ETL it's kind of a service graft connector um approach that you see like there's a dino Trace connector that does a similar thing thank you guys asked enough questions I think I have run out thank you great and I apologize normally will and I can hang around for 30 minutes or so afterwards but today I do have a hard stop um so while we're while we're here um what any other questions that we have um before we wrap Q&A is clear just good to see Matt again who said that I didn't see oh Mike Michael Hunt it's good to see you man how you doing good man good job yeah I see a lot of familiar faces on here thanks for taking some time to hang out with us yeah great demo for sure uh Matt you highlighted that in the knowledge conference you're also showing the demo around cuberes as well with the similar approach or it is the this is also has the cuberes um this was all this is all kubernetes under the covers um the demo that I think we're going to do a few different sessions at knowledge I'm not personally doing any sessions at knowledge uh but we have some others on our team who are doing some sessions at knowledge uh but come by and check us out at the booth and uh you definitely check out the sessions if I don't know if there's still spaces and things but we're happy to talk through any of it um ad hoc as well so looking forward to seeing everybody and then another question came in on the Q&A is there data validation on the yaml side before after that data is sent to the service now yes and the way that we usually set this up for customers is uh we'll have like a slack Channel that's for feedback and then if you do something that didn't work or if like something's configured in a way that we didn't expect um it'll spit that error and and successes spit successes too but it'll spit out the error into the slack Channel explain what went wrong so we definitely do validation um along the way and uh we have several checks that happen in the process of uh mapping those into cmdb awesome nice I got to tell you you guys are always the best audience right everybody comes with questions and and and brings good solid pointed intelligent interesting questions so thank you very much I love it yes thank you guys it's always great to get to talk to this crew and looking forward to seeing everybody soon perfect so I think with that we'll wrap the recording um and then uh we can open it up for other questions and conversation

View original source

https://www.youtube.com/watch?v=Jvi8QTUpZE8