Search
Follow me:
Listen on:

Day Two Cloud 140: Troubleshooting Cloud Outages With End-To-End Visibility (Sponsored)

Episode 140

Play episode

On today’s sponsored Day Two Cloud episode with Cisco ThousandEyes, we discuss how to monitor what’s broken in public cloud services. With the right information, you can offer a nuanced, knowledgeable answer when executives want to know when the company’s crucial customer-facing app hosted on a bunch of cloud services is coming back online.

There’s a big difference between saying, “We don’t know for sure what’s going on, but there’s a problem in AWS…hopefully they fix it soon…” and saying, “We know that for the last half an hour, the AWS API gateway in the US-EAST region is responding slowly to some requests. Some of our customers are working fine, but some aren’t. We will recommend a plan of action ASAP.”

To give an informed  answer, you need data that clearly shows you what the problem is, and that’s where Cisco ThousandEyes comes in.

On this episode, we’re going to dissect a couple of recent public cloud outages the ThousandEyes global monitoring system tracked, and talk about the newest product added to the ThousandEyes Internet Insights product line, called App Outages.

Our guests are Barry Wayne Leader, Technical Marketing; and Chris Villemez, Leader, Engineering Technical, Technical Marketing, from the Cisco ThousandEyes team.

Show Links:

AWS Outage Analysis: December 7, 2021 – ThousandEyes

Azure AD Outage Analysis: December 15, 2021 – ThousandEyes

Announcing Internet Insights: Application Outages – ThousandEyes

Blog.thousandeyes.com

@ThousandEyes – ThousandEyes on Twitter

ThousandEyes on LinkedIn

Transcript:

[00:00:04.390] – Ethan
Welcome to day two Cloud. On today’s sponsored episode with Cisco Thousand Eyes, we discuss how to monitor what’s broken in public cloud services. That way, when the executives want to know when the company’s crucial customer facing app hosted on a bunch of cloud service mesh coming back online, you can offer a nuanced, knowledgeable answer. Here’s what I’m getting at. There’s a big difference between saying we don’t know for sure what’s going on, but there’s a problem in AWS, and hopefully they fix it soon and saying, we know that for the last half an hour, the AWS API gateway in the US East region that’s responding really slow to some requests. And so some of our customers are working fine and some aren’t. And we’re going to recommend a plan of action ASAP to give that second confident answer. You need data that clearly shows you what the problem is. And that’s where Cisco Thousand Eyes comes in. On this episode, we’re going to dissect a couple of recent public cloud outages, the 1000 Eyes Global monitoring system tracked and talk about the newest product added to the 1000 Eyes Internet Insights product line called App Outages.

[00:01:06.310] – Ethan
And our guests today are Barry Wayne and Chris Villa May from the Cisco Thousand Eyes team. Barry, I want to hand this first question off to you. Okay. I’m setting you up with a softball here, but I think it’s a fun topic to hit public cloud outages. They’re getting more nuanced. Right. And a simplistic understanding, kind of like hit on in the intro, AWS is down, but for it operators, that’s actually not even a true statement. Aws is never completely down. That’s not a thing. It’s not even helpful to diagnose a situation like that. And so the obvious answer to get a more nuanced answer and an understanding of what’s really broken, you just monitor the cloud service provider status pages, right, Barry?

[00:01:47.150] – Barry
Not exactly. They certainly have their place. Right. And I think ultimately it’s a good source of truth if you think about like a post mortem. But for these types of outages, and just like you kind of hit on, it’s not as simple as the cloud is down. It could be a DNS issue. It could be something upstream for them that’s preventing your users from reaching the app that’s in the cloud. Whether that be AWS is your GCP whomever. And from a timing perspective, you’re already impacted. And you’ve been impacted. You’ve been spinning your wheels by the time the status PaaS is updated. So if that’s your first indicator of what’s going on, then you’re going to have a bad time because you’re spending a lot of time doing work that may not be useful. Trying to figure it out.

[00:02:32.320] – Chris
Yeah. Just to add to that, I mean, when we say, can I just know if AWS is down, that’s just like saying, hey, is the network happy that’s such a generic thing? I think Amazon Web Services for example, they list something like a robust 200 and plus different services as part of their cloud offering. So they have a huge number of these individual kind of service offerings that intertwine and talk together and ultimately provide all of the things you need to do to have compute, storage, network security, database, and provide these applications. So when we say AWS is down, what are we talking about exactly? I think that’s the crux of the issue.

[00:03:15.500] – Ethan
It’s not different from what we’ve done when we used to run everything on Prem. Right? Because as a network engineer, very typically one of my It roles someone might call over the wall and go, Is the network down?

[00:03:25.920] – Chris
And it’s like, no.

[00:03:26.990] – Ethan
The network is not that, dude, don’t ask me if the network is down, tell me what the problem is, and we’ll diagnose from there. And then I can look at all the monitoring systems, monitoring all the things, and then kind of drill in from there and figure out exactly what the issue is based on a symptom. And if you’re just saying, Is the network down, you’re not even asking the right question. And so it’s kind of the same thing with public cloud here. You need to know a more nuanced understanding of what’s broken to then figure out impact and so on and so forth. So it’s not different from what we did on Prem, although it is different, isn’t it?

[00:03:57.800] – Chris
I think back about 20 years ago when I was doing network engineering, I mean, we were control freaks about data, if you think about it. We had inline taps, even for our T one interfaces, our DS, three different. We built our own SDWAN. We had monitoring at the actual CSU DSU going to the T One. We had robust syslog and SNMP and flow data and inline caps and all of this stuff. We had an overload of data. Right.

[00:04:26.890] – Chris
So I had huge confidence. In fact, we had so much data that they started having to develop tools just to correlate all the data to make some sense of it, because you had so much of it. Right.

[00:04:37.400] – Chris
So it was data overload all you could give me. And I had the utmost confidence that I could find somewhere within this haystack that needle I was looking for. Right. I had an overabundance. And now we go to Cloud. I have the opposite scenario, right. Not only do I not have direct control to that level of all this infrastructure, I don’t control, but now I’m back hauling across the Internet as kind of my new man.

[00:05:08.360] – Barry
Right.

[00:05:09.040] – Chris
So it’s a different animal today.

[00:05:14.250] – Barry
I’d say the angle of your troubleshooting is even like, slightly different, because when you control everything, you really need to get to root cause you can remediate it. But for this, it’s really like you’ve got to isolate it down to just some sort of fault domain. One, two. Then you determine if it’s something you can directly impact and remediate. And three, if it’s not, you need to come with your case, prepared, your information, vetted when you go to your vendor to get the help. Because I think we kind of assume that all vendors are created equal and they have all of the information. But depending on the monitoring stack that you’re using, you might actually have more information about the issue than what they have readily available. So it’s just kind of a shift in mindset and the flow is a little bit different.

[00:06:00.870] – Ethan
It is different.

[00:06:01.580] – Chris
You’re right. And I guess you brought up a good point. So ultimately, when I’m trying to resolve some type of performance or some type of impact, it’s the two things you always care about. I got to isolate that fault domain, find root cause. And then here’s the other part. I have to find service owner of that piece of infrastructure or that service rather that spot where it broke. So whether that be my Internet provider, whether that be the guy that runs the firewall, or if it’s someone that runs the back end database, it was all still the same stuff. It’s just that we’ve now extended this into an area that’s quite honestly much more challenging and getting fault domain isolated and considerably more challenging. We can dig into this throughout this, but considerably more challenging to identify service owner to find the person or the group or the team or the Corporation or whatever it is that can address this issue or mitigate or work around.

[00:07:01.230] – Ned
Chris, in the world of public cloud, we have this idea of the shared responsibility model, which means you’re responsible for a portion of the application being functional, and they’re responsible for the services that they’re providing to also be functional, but they’re going to assume that you screwed up first. So you kind of have to prove to them it’s not me, it’s you. Here’s the data I have. Go fix your thing because I don’t have access to that layer.

[00:07:27.270] – Chris
Yeah, it’s the same thing that I remember being on call 15 years ago and you get the knock would see some error about a database here. It’s like literally a database there. It’s not an IP address or anything in there. And they’re calling me at 03:00 a.m. The first stop we go to literally anything that pops up on our dashboard. And it’s the same thing today.

[00:07:46.320] – Ned
Right.

[00:07:46.620] – Chris
So now I may no longer be network engineering. Maybe now I’m a service delivery engineer trying to interface with all this stuff. But it’s still the same thing, right? It’s still on you as the operator of this application and this service that you’re providing to your customers and your partners, it’s still on you to determine where that fault is ultimately.

[00:08:09.810] – Ned
Right. And those outages do happen on the major public cloud providers, and we need to figure out how that impacts us as operators. One example that we’ve talked about before, and I think you have some really good information on is an outage that AWS experienced. It was on December 7, 2021. So can you give us the 10,000 foot view? What was the outage and how did you measure that outage from 1000 Eyes perspective?

[00:08:39.510] – Chris
Yeah, no, this was an interesting one. So again, keeping in mind that our goal is operators of services housed in cloud or services, wherever it is in It Ops that I’m responsible for. I’m trying to determine. Is it my ISP connection? Is it something collective on the Internet? Is it something specific within AWS? Is it something specific within my environment? And so what we saw in this one, it was a little confusing at first, right? Because the first thing that started getting impact was EC Two service. And in fact, what was happening is if you had an existing EC Two virtual instance within Amazon, you are probably fine. But if you were trying to do any kind of modifications to EC Two, you’d start to run into these issues. And we started seeing kind of increased API error rates. I think if I recall, it was actually located in their US East one, reaching over there in Ashburn, Virginia, and it actually started impacting multiple other services. So the first impact we saw, AKS 1000 Eyes, was EC Two, or one of the first impacts we saw EC Two was having some issues and some other APIs that we’re monitoring, like S Three.

[00:09:56.990] – Chris
We started to see some other issues, and this ended up Cascading, I think it ended up impacting Dynamo database, connect various other services within Amazon. But the interesting thing we saw with the tool is that we were able to pretty quickly isolate this to number one being an issue on the internal Amazon side. So the way we do that with Thousand Eyes is we get that end to end layered visibility. Right.

[00:10:23.280] – Chris
So we look at things from just the basic transport layer, just from A to B. Do I have anything that’s impacting my It packets loss, latency jitter throughput any of these sorts of conditions, and we saw nothing. And everything looks healthy and green. We had good latency, no packet loss between the various vantage points around the Internet and Amazon’s Echo and various other API services. But what we did see is when we take it up the layer and we start trying to talk to the API itself, that’s when we would reveal the issue. And so this is one of those kind of interesting things that it just reveals how much these dependencies kind of weave together. Right. I think if I recall, the main issue was something related to their API gateway, but this, in turn impacts almost everything behind that API gateway. So like I said, it impacted DC Two, Dynamo DB, S Three, the storage service, all these other additional services.

[00:11:30.150] – Ethan
Chris, if I remember right, that outage against the API gateway was a Gray failure. That is, it wasn’t like the API was completely unresponsive. It was more like it was slow and sometimes transactions would go through fine and sometimes they would time out just because the API seemed busy or overloaded or something. It was that sort of a failure, was it not?

[00:11:53.590] – Chris
Yeah, you’re right. It was something like I think we might have seen like 60% or so success rates for certain of our tests and then 40% failing. So it’s one of these sorts of things. And actually, that’s a good point. I mean, the idea of these Gray failures or what I used to call soft failures many years ago, that can be some of the most challenging things to solve because then it becomes so difficult to know. Is this me? Is it only me experiencing this outage? Are other people experiencing this outage? When is it happening? Is it happening all the time? Is there certain combinations of things that allow maybe my customers to be impacted, not other customers. It becomes a very challenging thing to try to ascertain. Now, if you get enough of a data set behind it, like something with 1000 eyes visibility, you can make some more conclusive assessments. We could start saying, okay, this is consistently, if not 100% of the time. It’s still consistently intermittently failing at responding when we query Amazon’s API gateway service, for example. And so it was pretty clear that that service and that’s kind of one of their main front end services was having an issue of some kind, not 100% failure, as you know, but some type of issue that was greatly degrading performance.

[00:13:19.290] – Chris
Right.

[00:13:19.790] – Ned
If I’m the individual user and I just go to spin up an EC two instance and it fails, I might try a few times and then eventually it will succeed because they’ll hit that like 60% threshold you’re talking about. Or if I’m using a TerraForm script to build out a whole environment, I might have to run it a few times or it has some internal back off logic to try a few times before it gives up and throws me an error. What you’re talking about is no, you’re just constantly pinging those APIs and sending requests with an expected output. How do you figure out what to test for against the APIs that AWS has?

[00:13:54.270] – Chris
There’s going to be some responsibility on the architect? Let’s say I’m a native cloud architect. I’m building my stuff up in Amazon. It’s going to be on me to understand what services I’m using, what APIs I’m hitting, and so forth. And ultimately I would have some kind of application architecture that I can provide to it Ops or people responsible for monitoring the service, hopefully, yes.

[00:14:29.070] – Ethan
The point is, someone that understands the application would say, hey, these are the things that you should be testing. Here’s a good list. This way we can have a robust monitoring that because of the results that we’re getting back. We kind of know not just that it’s up or down, but that the app is responding in a timely way and with the results set that we expect back when it’s working, right?

[00:14:50.550] – Chris
Yeah, exactly. So, I mean, if I’m building some application and I know we’re using, for example, AWS, what is it? Identity and access management, their Im service. Well, then that’s probably something I should note to my folks and say, hey, this is a critical service that needs to be monitored, maybe the dozen or so API service mesh I’m interacting with here’s one of them. And this one is going to be particularly critical because it pertains to authentication. Right. But it’s still going to be on me to know, at least with the list of things. Just like if I’m going back 20 years and I’m building something in a data center. Well, hey, I’ve got a fleet of these SolarWinds boxes and they’re running this, and I’ve got Oracle DB database on the back end, and it’s at this IP. I was still responsible for providing that application architecture. So it’s the same thing today. I don’t think this is any different. It’s just that we maybe can get a little sloppy or lazy about it potentially because it seems so seamless. But it’s in concept.

[00:15:58.890] – Barry
And I would say that you’re definitely right, Chris. It’s a lot easier to kind of ignore some of it now or wait, but it comes up, right. All it takes is like some sort of issue or a migration or rolling out like an additional micro service and something borks. And then it’s like, was that normal? Is that the latency we normally see in between these two points? Is this normally this slow? Does this normally happen like this? And then you get to the root of we don’t have any normalized data set, so how would we know, other than gut feel? And it’s like a kind of working in reverse. Right. So I view it like you set up a unit test, you set up progression testing. If you’re on the SRE team or anything else. This is just like another component of that, trying to be more prepared for the experience.

[00:16:47.790] – Chris
Yeah, this is definitely something SRE would greatly benefit. The reality is, think about what we’ve done with Cloud. Basically, we’ve separated all the pieces of infrastructure and software into these street specialized apps, not just handle things like storage compute, database security, but network web load balancing, everything else. So in other words, we’ve got a bigger solution split into these various individual software pieces. And this is maybe the same as it was before, but now these individual software pieces are going to have different locations, different IP addresses. They’re going to require APIs now to talk to each other and stitch everything together into the seamless thing. So it becomes where you might be could once monitor aspects of, let’s say, three systems. Now maybe you’re monitoring aspects of 20 systems, and it can get even worse with my cloud apps. I’m talking to various SAS things like maybe I interact with PayPal for payment services or Vimeo for video streaming or whatever the thing is. Right? So now I’ve got not only all my pieces within cloud are chopped up and accessed and ultimately monitored by various APIs, but I may be talking to stuff outside even AWS environment that also has separate service ownership and again, additional infrastructure that’s still not mine.

[00:18:19.110] – Chris
So it can be an unwieldy thing.

[00:18:25.270] – Ethan
Well, Chris, there’s a piece here that we can help the audience visualize if they’re using thousand eyes to do some of this monitoring where thousands just kind of gives it to you out of the box and maybe you click something and magic happens versus how much time you’re spending writing tests that are unique to your environment. I have 1000 eyes customer, but way back in the early days of the product, going back like 2015 or something, I used it on a wide area network to do lots of monitoring and testing. Everyone’s on the cloud now days people were doing that, but that was more or less a new thing. So if I were to fire 1000 Ice today and I want to validate that a variety of different AWS services are up and running, am I checking boxes to enable tests and then magic happens? Or is it like, here are a series of tests you can enable. Here’s one, do you take it out and then, I don’t know, fill out a form to do a bunch of customization of how that test is going to interact with the service that I want to validate?

[00:19:32.170] – Chris
Yeah, within the product itself. It’s not a matter of check boxes in the sense that, hey, we understand all of the service delivery offerings of Amazon, and we’re just going to put it in our GUI and you can check them off. I think it would be hard for us to keep up with that just because they’re continually adding new things and modifying. But here’s the thing. It’s still almost that easy, but I would still have to know what those service endpoints are. And of course, now, as we know everything’s referenced by a web URL now, thankfully for Amazon, they’re fantastic documentation, right? I can go on to look at whatever it is. They’re huge. They’ve got like a 500 page document of all their services and all those service endpoint URLs and everything else you could possibly want to know about each of these services. So I have all of that information available to me as a cloud developer, but it’s going to be a little more involved than just checking a box, but it’s not much more involved. I have my list of service endpoints by URL and I can enter them into the product, and I would readily start.

[00:20:41.840] – Barry
One thing that has changed, though, Ethan, that to Christmas point. It’s not as easy as checking a box. And maybe it’s not as fine too, as a very specific test to a specific back end service. But you kind of alluded to this in the intro. But we have Internet insights and application outages. And the reason I bring that up is it lets you take advantage of composite data from all of the tests being executed in the platform, not just your own. And what’s cool about that is if you’re interested in AWS, you’re interested in Azure, you’re interested in any other service. Whether that’s like the majority of your app is listed or just a micro service you use, you can actually come in, select from a list of Is services, select from a list of providers, really anything, and see a rolling twelve months of outage data, whether that’s network related or application related, and it updates in real time. So not only is it like, hey, let me look back and see if this is a healthy service and one that I should use when I’m architecting something. But also, is the world impacted?

[00:21:51.230] – Barry
Is this region impacted? It gets you a place to start, right?

[00:21:54.910] – Ethan
So again, just restating that that is the app outages. That’s an add on. That’s a Bolt on for the thousand Ice products. I licensed that and then I have access to that data set. And the data set is, again, not just my data from interacting with those services and running my own tests. It’s some kind of anonymized data set from everyone within 1000 AKS world, all the thousand customers that are I was going to say pinging that service, but it’s going to be a more robust test than a simple thing. Right? You know what I’m saying?

[00:22:24.040] – Chris
Because if you think about it, Barry might know the exact number, but we’ve got millions of tests going at any given time. So the idea here is that not only do you benefit from your own test that you purposely set up. Right, but now you can benefit with this Bolton to see information about aggregated application and network outages gleaned from telemetry from vantage points all over the globe. So we’ve got that collective cumulative anonymized data from essentially millions of other customer tests to give these essentially bigger insights into bigger things that Azure happening that are happening in terms of service provider and staff provider impacts.

[00:23:06.250] – Ned
Key thing that you mentioned a little while ago was sort of the multi cloud nature of it, the fact that it’s not just monitoring AWS, but it’s also monitoring Azure and other public clouds you might be using. And I think that’s really, really important. I just finished reading one of the State of the Cloud reports, the one for 2022, and it said 89% of organizations are pursuing a multi cloud strategy. They’re going to be interested in more than just AWS and what’s going on there. To give an example, you also brought up an outage as we were discussing what to talk about in Azure. That happens a week after the AWS outage. And so same question there, what happened and what did you see from Thousand Eyes?

[00:23:49.410] – Barry
Yeah. So this one is actually a really good one around sort of that I don’t necessarily like soft down, but really around the prevalence of distributed cloud services to make that sound really eloquent. But the idea with that one or what we saw was if you were just monitoring some endpoints in Azure, like, hey, I have this hosted in this VPC. Everything would have probably looked fine. Now whenever things started to look funky, that would be when you went to authenticate against the service that you’re using or you were using as your ad as your identity provider. And that’s where we started seeing failures. Right. So you get like a 403 back or you see something weird happening from an authentication perspective, whether that’s, you know, I refer to it as a white screen of death. Whenever the old Samuel exchange kicks off and you’re just sitting there spooling indefinitely. It was a painful one because that was one that your users felt. You couldn’t just explain it away like, hey, this is slow or hey, we’re aware of this particular portion of the application that’s broken. You literally would have just had to have been one of the lucky few that needed to get a new token, basically for your session and authenticate.

[00:25:05.210] – Barry
And if you were that winner, then you didn’t get your token. You just got a quick trip to staring at a blank page. So it was resolved relatively quickly. But that one again, we take this broad sweeping approach at Thousand Eyes where we monitor everything. Right? We’re in the business of knowing everything we can know about what’s happening on the Internet or with these broadly used apps. But for people who run ups or design sort of their monitoring approach, this is like one of the most off forgotten parts. It’s just, hey, is my site up or down or can I get to the front door? Can I do the other things I need to do?

[00:25:43.850] – Chris
Yeah, I was just going to add, imagine being SRE in these environments right now as an SRE, especially in multi cloud. Now I have to worry about in terms of being able to get from point A to point B and performance between these two points. Now I have to worry about cloud to cloud, whether it be multi cloud or interregion or inter availability zone, whatever it is. Right. I have to worry about clients to cloud. So this is a service delivery stuff. What customers Azure seeing, I might have to worry about client to a SAS endpoint for customers or enterprise staff. And even as a cloud Dev, now I have to worry about cloud to various SaaS services. I’m trying to get to Jira, GitHub, Docker Hub, these sorts of things. So it’s just again, this is SRE stuff, I think, in today’s world. But it’s that cloud to cloud monitoring. Again, whose fault is that? I’ve got some piece of something talking to something to talking to something. How do I even go about determining where this especially when I own none of this infrastructure. Right.

[00:26:54.410] – Ethan
You’re getting at something here. That’s another one of those problems. That’s the same as it ever was, but it’s different now. There’s some different parameters around it, and that is dependency. So back in the days when we all lived in our little silos and had our own monitoring systems, we look at our stuff and have red light, green light. And if it was network monitoring system, you didn’t necessarily know what apps were impacted by a slow link. Let’s say you didn’t know what that dependency tree was until you’ve been around the.ORG for a while and you kind of seen some problems come and go, and you kind of built those mental models up. Then we went and moved into application performance monitoring, and maybe you had some tools that would help you with those dependencies better. Great. Now we’re in the cloud environment. You’re talking about multicloud cloud to cloud, making calls to a variety of services that are part of delivering a particular application transaction. There can be all kinds of complex dependencies. How? That’s the question. How do I figure out what those dependencies are and understand those from a monitoring perspective? So that it’s fairly clear to me the AWS API gateway is having a 60% fail rate, and that’s breaking X application that matters to the business.

[00:28:09.980] – Barry
It’s a composite approach, for sure. I mean, you mentioned a few different technologies that make it a lot easier, whether it’s a combination of an APM implemented directly on top of the service that you’re managing and hosted in the cloud to sort of help map out some of that back and stuff, where it’s testing from 1000 Ice perspective. One, in a vantage point that sits where your users sit. But two, from a vantage point that actually sits within that cloud environment and testing against all the requisite services we talked a little bit about earlier, sort of the need for someone to sit down and really map out these things when they architect the software or whatever application they’re delivering or experience they’re delivering. It’s probably the most appropriate way to phrase it, but you can actually get there on your own or not entirely, but pretty darn close. And you kind of take that approach as a user and you access this web based app. Well, your browser is making all of these calls and your browser is going to tell you what it’s hitting if you just know where to look for it. And what’s cool is we have testing techniques just like you mentioned earlier.

[00:29:18.700] – Barry
It’s not as simple as a Ping. It’s as complex as a bot running Chrome and accessing stuff and pulling down every single element and telling me exactly where that element is hosted and all the performance timing around those specifically. What’s cool about that is you get a lot of the picture without a lot of the effort, because all you need to know is where to tell it to go when you configure the test. Say, hey, this is the app I’m interested in, and you can record yourself clicking through a few different things. And what’s awesome about testing that way is you’re accounting for like, the author we were just talking about. We’re performing authentication. If it fails, I can’t do any of the other stuff I wanted to do, and it’s going to tell me, hey, bud, you can’t log in. You need to go look at the login service or hey, there’s 100% loss right here. That’s probably not so good. And it gets you there fast, but you don’t have to spend AKS much time or effort kind of covering things from all aspects.

[00:30:19.010] – Chris
Yeah, I mean, that’s a good point. That’s really the benefit of synthetic traffic. It gives me the ability to put essentially like an operational overlay on top of a network that I don’t control. But I’m now using it as my own personal network and then clean, actionable insights from underneath. Right. And that’s ultimately what we’re trying to accomplish with and what we accomplished with Thousand Eyes.

[00:30:45.350] – Ned
Something that strikes me is you also need to know what your objectives are for your service levels, because if you know that maybe there’s a 2% failure rate on that API, is that okay? Do I have retry logic somewhere else that’s going to be fine with that? Or is this a situation where 2% of my customers are going to be unable to complete their transaction, which is $20,000 a minute that I’m losing? If that’s the case, I need a different Slo applied to that particular service. So I think it’s really important from the architecture standpoint to first establish what are my objectives, what am I actually trying to deliver and then have everything else kind of trickle down out of that? I guess that’s the whole SRE approach, application design.

[00:31:32.320] – Barry
Now you nailed it. I think. Yeah, you set goals and then you utilize a normalized data set to validate that those goals are accurate based on what you see, and then create a round book or alerts and everything off of that. Yeah, you nailed it. You’re 100% right.

[00:31:50.610] – Chris
And that’s kind of how I see what Thousand Eyes is trying to accomplish, what I think it does accomplish. Essentially, we needed a new common operating language to describe Internet performance, SaaS performance, cloud performance, Web app, and API performance. How do we get that if we try to separate all the pieces out and like, hey, I only want to look at this from an IP layer perspective of loss, latency, jitter, and so forth. But how do I map that to something that a user experiences. Right? So it’s more than that. I could throw up a script with IP for hping or something and try to ascertain something between two points. But that’s a very kind of a limited metric that I’m going to have now. That’s not going to give me a full layered view of how that application is performing across all of these different spans of control.

[00:32:46.040] – Ethan
Ultimately, this may not be a fair question to ask you guys, Barry and Chris, but I’m going to ask it anyway. Based on your observations of the kinds of cloud service provider outages that we’ve seen. We’ve talked about two here, the AWS and Azure outages that happened at end of 2021. Do you have recommendations or thoughts on how I build a resilient application? That when some major cloud component fails, my app is resilient and it carries on.

[00:33:16.220] – Barry
That’s a really good question. I don’t know why, but when we get something like that in my head, it’s kind of like someone talking about your stock portfolio. Right. Diversify, diversified. Not to use Buzzwords, but I’m going to it’s multi cloud, like whether that’s multi region, whether that’s across different vendors. You really need some sense of sprawl, really. You just need to think about your strategy for Mitigation. And I say it that way because people don’t. You architect the app and you just assume that this is a high availability environment that I’m hosting it, and that’s good enough. And sometimes it’s not right.

[00:34:03.550] – Ethan
Where it’s super resilient, redundant, blood. It’s going to be great. And then it’s like, yeah, you really should be spreading across availability zones. Yeah, you should really be spreading across regions. Really. And have you thought about any casting and so on? Your ability to be truly resilient gets just within one cloud. It’s a massive undertaking with a lot of different components you can light up and pay dearly for to get that resilience. It does become an architecture question, but then it’s also a money question. And then it’s also all my customers are in this one place, and it doesn’t make sense to stand something up in Europe because I don’t have any customers there. Maybe. So it’s complicated.

[00:34:46.250] – Chris
Yeah. I think that hits it when you say, what can you do? What can I do as a cloud Dev to make my stuff more resilient or make my stuff more available, I should say is factoring in resiliency into the design. I know that sounds simple, but there’s a reason everyone pays double for network firewalls because they want two of them, right? At least, right. You have active and you have a standby. We’re used to factoring in resiliency into our designs. Aks Infrastructure architects. If money is unlimited, you can have almost everything. You can have so much resilience, it would be almost impossible to bring you down. But of course, we know money is not limited. So we go, well, here’s how resilient. I can afford to be. And that’s how we all do things. But even within a cloud, we want to factor in resiliency into the design, whether it be geo redundancy or availability zone redundancy, other layers of redundancies. And beyond building in resiliency into my design, I need to understand what are the fault lines within my architecture. Where can this break? In other words, if I can’t get to the EC two API, what does that hold up for my service delivery process?

[00:36:03.500] – Chris
Like what’s going to stop? And how does that impact me and these sorts of things? I think the only thing I would add is the third thing. In addition to building and resiliency, understanding my fault lines is just having that visibility. Right. So having again, the same thing we’re talking about here today, having visibility monitoring and as a result of these two things, get actionable insights. Not just insights, but actionable insights. Right. I mean, I could run a Python script against my infrastructure and find out all the firmwares that are even numbered firmwares, but that’s stupid insight that doesn’t give me anything. But if I have something that’s beneficial, I’m like, oh, hey, this is a path that’s seeing persistent issues and maybe you can do something with BGP to construct a different path. This is SRE stuff, but something that’s actionable. So it’s really about having that visibility and visibility over again. A network that used to be considered a black box. I mean, you go back 20 years, the Internet was still AKS a little bit, but it was like the Wild West. It was even more wild west like it was use it if you must kind of thing.

[00:37:08.410] – Chris
But otherwise, let’s get our ATM. I’m a multi link frame relay and build a really expensive in house lands that we have all the control over ourselves. And that’s how we used to do things. But we’ve seen such an improvement in the Internet over the last two decades that we’ve gotten a lot of confidence to use it as our own personal network. Right. And not just that, but you have the increased reliability and availability of the Internet coupled with this kind of everything as a service. So now I can trust the Internet a little more. Not fully. And I’ve got all of these really lovely services out there. So I don’t have to have all that capex on site. I can have reduced operational headaches of managing the infrastructure. And I have all this flexibility and agility, but I lose some visibility and control over that stuff when I gain the agility to deploy across it. Right. So I saved costs, but then I’m adding costs in other ways. It’s an interesting time, but I think the takeaway here is having that visibility, having that visibility over the black box of Internet and all of these services.

[00:38:17.650] – Ethan
Well, actually, Chris, I do have a follow up question to that one, because if we do have this nuanced knowledge because of all this great monitoring data and we kind of understand at a deep level even what’s broken. How does that help me reduce my MTTR? Because I don’t own whatever is broken. So what am I supposed to do with all of this amazing data?

[00:38:40.340] – Barry
I think about doctors pre understanding of what germs were or anything else, and you’re trying to fix someone’s disease by throwing leeches on them to get the ghost out of their blood. That’s kind of like what is like if you can’t see what’s there, how could you ever hope to fix it right? So it was a very long wooden folksy analogy, but the idea is without visibility into any of the problems, how Azure you going to do anything and how are you going to figure out how to impact it, whether that’s you logging into a box and bouncing a service or if that’s getting on the Horn with your vendor and providing them the information to show what you’re seeing. Like, hey, my customers are impacted because I’m getting 500 errors every time like this part of my service hits your API. Like, what the heck is going on? That’s super useful. You may not be able to fix that service, but you’ve eliminated the network, you’ve eliminated a regional issue impacting only a subset of customers, and you’ve eliminated like 90% of the back end services. By narrowing it down to this one specific thing.

[00:39:47.770] – Ethan
To our conversation about dependencies, I guess it could poke a hole in your architecture where it’s like, hey, guess what? If this service is down, your whole app is out. And so then it could be maybe go back and work with the Dev team to say if we re architect this a little bit, we can eliminate this dependency because we could provide a redundant service for this in this other way, whatever that might be.

[00:40:09.860] – Barry
Yeah, even from an error handling perspective, even if the service is such that you can’t build in redundancy, actually letting someone know like a relevant error message when something happens, it goes such a long way. And I think of it just like whenever you call something that you’re a customer of what you call your ISP and you say I have a problem, the worst thing you can ever say is we don’t know where we’re looking into it, right? As the customer service rep on the other end, that’s horrible. That triggers everyone. And so even if you think about it from like an application perspective, if that front Linode before your customer even reaches out to figure out what is going on, if that’s telling them something helpful, like hey, we’re aware of it, or like third party service down something like that, which you could get to there just based on what’s happening, you’ve gone such a long way in delivering a positive customer experience, even when something isn’t going right. And so I think that’s huge yeah.

[00:41:06.900] – Chris
Well, let me just add to that first, I’ll say that what you glean from these insights, they can be both short and long term things, right? It depends on the scenario. But maybe there’s some type of a scenario that I can do some modification with BGP and take my alternate service provider, or maybe there’s something I can temporarily stand up in AWS, if there’s some issue with us east one or whatever the issue is, and there’s also going to be long term gains. I mean, maybe I’m seeing a persistent issue with one of my CDN providers for whatever reason, or my video hosting provider and I need to make some changes or I need to work something out with that particular provider. So I think with the complexity of all of these service endpoints scattered across the Internet, not having that visibility is certainly not going to help anything. And then having the visibility you do get these long and short term kind of gains. And to what Barry said, if nothing else in it Ops, one of the key things of working through these issues is to inform stakeholders, set expectations, provide status updates.

[00:42:23.210] – Chris
And like Barry said, if you just say, hey, yes, that critical business service, you use it’s down, we can’t do anything about it and we’ll just wait. That’s not going to go over very well with you. So I think it ups process is going to mandate that we have some type of thing to work through, even if we get to the end of going, hey, here’s service ownership here’s. Who is responsible for this problem. We’re in communication with them, and I can provide status updates on that.

[00:43:00.030] – Ned
Well, Barry, Chris, let’s bring this in for a landing. Do you have some key takeaways for the audience out there?

[00:43:09.070] – Chris
Yeah. So I think what we’ve learned and just all of this monitoring that we’re doing is that everything is connected, everything is interdependent. So if I’m responsible for not just service delivery but availability of some application, I have to know what all those pieces are. I have to know not only how they’re connected, but how I can provide service assurance for the total application experience as well as verify service level of all these individual pieces. So in the case of Amazon’s Outage, for example, we saw that their API gateway was having an issue. And of course, on the back end impact all of these other services that depend on it. We look at the example of just all of these additional SaaS services that we interact with even at the application developer level. Right.

[00:44:10.480] – Chris
Like I’m talking to PayPal, I’m talking to Vimeo, I’m talking to GitHub, I’m talking to all these different things. I’m using Federated authentication. Right. So now maybe I’m relying on Facebook’s authentication service to do something with my application. So it can be a real nightmare if you don’t have a way to glean insights about the performance of all these services and some way to basically represent a user experience that says, hey, those 20 services on the back end that are stitched together, here’s how it’s looking for our customer when they go through the application experience and then to provide that kind of layered visibility to tell me, well, hey, when it’s not working well, when it’s not performing well, here’s exactly where it’s breaking down. And I think that’s just that couldn’t be more critical today. I think it’s always been critical to have that level of visibility, but I think it’s even more critical considering we just don’t own the infrastructure that we’re relying on.

[00:45:12.530] – Barry
And I think another thing that we talked about today that’s really important is just we were discussing database or outages. But it’s really reminding ourselves that an outage doesn’t necessarily mean the same thing. It always has. It doesn’t mean it’s hard down. It means that there’s a litany of things that can happen, whether that’s degraded performance, whether that’s really specific workflows being impacted, whether that’s not even being able to get past the front door because authentication is down. And so I think the idea is you have to be deliberate in your actions, both from a design perspective, when you’re building these services, when you’re selecting what micro services to incorporate into the application, but also when you’re planning for day two, like when you’re working from an operational frame of mind and building out monitoring, building out your service level objectives, everything, you really just have to think holistically, right? Because it’s not just as simple as hitting the front door. It’s really you need a good mapping of the services and you need a good understanding of what normal is. So you can detect deviations and really understand is this impacting my business? Is this going to cost me, whether that’s from a financial perspective or even a brand reputation perspective internally or externally?

[00:46:36.300] – Barry
So I think that’s one of the huge takeaways there.

[00:46:39.480] – Chris
Yeah, no, I disagree. I think having a historic data baseline and continuous monitoring coupled with historic baseline is critical. Just to understand how I’m performing, I would just add to another takeaway is like what we just mentioned is if I’m a service desk or I’m an SRE or I’m it operations and I don’t have an answer, that’s the worst thing I can say is, hey, there’s something down and I just don’t have an answer. When we look at our example of app outages, this is what this was built for. Quickly providing that end to end view across the entirety of digital supply chain and getting that insight as to what might be happening. Is it affecting me? Is it affecting other people? Is it a provider, is it Internet provider, cloud provider having that visibility? I think having that answer really is really critical. And with thousand eyes that’s what we strive for. Right. Is giving as much visibility as possible to that point.

[00:47:48.510] – Ethan
Chris, you’re gathering data from a global vantage point. You’re taking measurements from all over the world.

[00:47:54.210] – Barry
Yeah. There’s like 400 different ones around the globe that we’re using all the time. And that’s not inclusive of what our customers are doing from their own environment. So it’s just an insanely comprehensive data set that we use to draw these conclusions. Right. So it’s not just AKS my service down. I’m in Brunswick, Ohio, working from home, and I have an ISP issue that’s impacting me. It’s getting to that level of detail because we’re casting such a broad net and it’s huge.

[00:48:25.760] – Chris
Yeah.

[00:48:26.040] – Chris
I mean, that was the problem we had to solve. Right. It’s just being able to give enough vantage points visibility to your point, Barry? I mean, we do we have over 400 service provider network locations, 195 cities, 61 countries where we’ve deployed and managed these cloud agents. And on top of all the locations that customers are putting the enterprise agents. Right. So it really gives us the ability to any use case you can think of for cloud, whether I’m a cloud native developer, whether I’m an ecommerce portal manager, whether I want to get the perspective of customers, the perspective of the development team trying to do service delivery, or whatever it is, I have the ability to get that visibility. That A to B measurement, analytics, heuristics visibility to tell me exactly what’s happening at any given time relative to how it performed in the past as well with that historic baseline.

[00:49:27.950] – Ned
Very good. Thank you, Chris. Thank you, Barry. If folks want to know more, you’ve piqued their interest. Barry, do you have some links or places folks should go to learn more about Thousand Eyes?

[00:49:38.850] – Barry
Yeah, just head on over to Thousandight.com. We’ve got links to our blog. We’ve got white papers, state of the cloud reports, videos, pretty much anything you want to know you can get to from our main page.

[00:49:53.030] – Ned
Excellent. And if folks want to go to one thousandine. Compacketpushers, they’ll know if they came straight from here.

[00:50:00.350] – Barry
Smash that like and subscribe.

[00:50:05.610] – Ned
Well. Barry Wayne and Chris Velma, thank you so much for appearing today on Day Two Cloud. And our special thanks to Cisco Thousand Eyes for sponsoring today’s episode. That is how Ethan and I feed our families every week. They keep asking for more food. Jeez, virtual high fives to you for tuning in. We really do appreciate it, but we hope it’s time well spent. Ideally, you walk away from a Day Two Cloud podcast being able to do what you do just a little bit better so you can feed your own families who are probably also demanding more food. If you have suggestions for future shows, we would love to hear about it. You can hit either of us up on Twitter. We both monitor at Day Two Cloud show. Or you could fill out the form on my fancy website. Nedinthecloud.com, if you like engineering oriented shows like this one. Visit PacketPushers netsubscribe. All of our podcasts newsletters and websites are there. It’s all nerdy content designed for your professional career development. Until then. Just remember cloud is what happens while it is making other plans.

More from this show

Day Two Cloud 147: Google Cloud Is Not Just For Devs

Today on Day Two Cloud we peel back the curtains on Google Cloud with a GCP insider to find out how Google Cloud differentiates itself, its embrace of a multi-cloud approach, and more. Our guest is Richard Seroter, Director of Outbound Product Management...

Episode 140