Search
Follow me:
Listen on:

Day Two Cloud 110: Automation’s Unintended Consequences – The Bunny.net Outage Saga

Bunny.net is a CDN provider based in Slovenia with customers and PoPs all over the globe. On June 22nd, an automated code update led to a cascade of failures that crashed the company’s DNS servers, wiped out most of its CDN capacity, and affected 750,000 Web sites.

Despite multiple redundancies built into the system, hidden dependencies thwarted attempts to roll back, reboot, and redeploy. Over two frantic hours, the team struggled to identify the problem, get fixes in place, and get the service up and running. In an effort to be transparent with customers, Bunny.net wrote a detailed postmortem that it shared on the company blog.

On today’s Day Two Cloud, we talk with Bunny.net to understand what happened, what the company learned, and what other infrastructure operators can take away from this experience. Our guest is Dejan Grofelnik Pelze, founder of Bunny.net.

We discuss:

  • Automation and dependencies
  • Leveraging testing to reveal problems that hadn’t been considered
  • How to recognize single points of failure
  • Considerations around fate sharing
  • More

Sponsor: Zesty

Zesty provides an autonomous cloud experience by leveraging advanced AI technology to manage the cloud for you.
Our AI reacts in real-time to capacity changes and enables companies to maximize cloud efficiency, reduce AWS bill by more than 50%, completely hands-free. It’s cloud on auto-pilot. Find out how to spend less and do more at zesty.co.

Show Links:

The stack overflow of death. How we lost DNS and what we’re doing to prevent this in the future. – Bunny.net blog

@dejanpelzel – Dejan Pelze on Twitter

Dejan Pelze on LinkedIn

Bunny.net

Transcript:

 

[00:00:00.980] – Ned
Zesty provides an autonomous cloud experience by leveraging advanced A.I. technology to manage the cloud for you. Their A.I. reacts in real time to capacity changes and enables companies to maximize cloud efficiency and reduce their AWS bill by more than 50 percent completely hands free. Cloud on autopilot with zesty companies can spend less and do more. Check them out at Zesty.co.

[00:00:33.720] – Ethan
Welcome to Day Two Cloud. And today we have the story of an outage. Yeah, the story of an outage. We love these stories because are lessons learned for the rest of us. And the story is brought to us by the folks at Bunny.net, not that this is a sponsored show. It isn’t. I just spotted this very transparent recounting of a two hour outage that Bunny dot net had, and it was up on Hacker News. There was some discussion about it.

[00:00:57.870] – Ethan
And I reached out to them and said, hey, you want to come on the podcast and tell the story? I just thought it was fascinating, this set of cascading failures that were tied in with automation and DNS. And it was one of those failures that everybody saw. And what did you what did you get out of this conversation Ned?

[00:01:15.420] – Ned
You know, I got a couple of things out of it. One was that the ultimate test is production, because no matter how much testing you do ahead of time, once that code rolls out into production, now you’re testing the real world.

[00:01:25.740] – Ned
And it’s really hard to test for everything ahead of time because you just you can’t. So that was one thing that was that was a big takeaway for me. And the other is when you’re designing systems, you should really try to avoid circular dependencies. But sometimes you don’t see those dependencies until you have a cascading failure.

[00:01:45.990] – Ethan
We get into the details with the founder of Bunny Dot Net Dejan Grefalnick Pelzel. Well, Dan, welcome to Day Two Cloud. And as we said here in the intro, you’re here to tell us a story about bunny dot net at a very bad day that you had. And were very transparent about in your blog. But we got to start at the beginning here. You’re a founder or maybe the founder at Bunny Dot Net. Would you just give us the high level overview?

[00:02:11.610] – Ethan
We don’t need, like, tons and tons of detail, but just so we have a big picture idea, what is bunny dot net?

[00:02:16.620] – Dejan
So Bunny dot net actually started as bunny CDN and we had rather humble goal of building up an affordable content delivery network. But bunny dot net is actually the evolution of that. So we have a much more ambitious goal of building a faster Internet or how we like to say, making the Internet hop faster. So we’re building a set of products on top of the CDN now. So it’s really to help developers accelerate, secure and deliver content and basically make the Internet faster for everybody.

[00:02:53.070] – Dejan
So in the modern world, you know, every millisecond matters. And it’s it’s kind of our goal to make this global simple.

[00:03:01.870] – Ethan
OK, so we get the high level idea roughly stated. It’s a CDN and now you’re building additional products on the CDN, developer friendly and you’re making the Internet faster. Got it. There’s a bunch of companies that are in this space. We know what you’re trying to do with that high level. Now, you had a bad day and it caught my attention that you had this bad day because it made Hacker News, this blog post that you wrote, explaining what all went wrong, because there was a couple of hours or so that you guys were off the air pretty much.

[00:03:33.900] – Ethan
Now people can go and read that blog. But still to set up this conversation, if you could. Again, we don’t want to hit every detail, but just give us enough telling us about your bad day so we kind of know what broke where you ended up so that you could fix it and then how you got back on the air. And then we’ll we’ll drill into what all that means for us once you’ve told that story. So go ahead, man.

[00:03:57.090] – Ethan
Tell us at the beginning.

[00:03:59.440] – Dejan
Yes, so the beginning is what it’s usually it’s usually is so, you know, doing a routine update, just you never sleep in a global product. And we just deployed an update for the SmartEdge routing system. Then a few minutes later it’s just oh sh*t oh sh*t oh sh*t where everything’s broken and took us a few minutes to realize what’s going on.

[00:04:29.380] – Ethan
You said everything was broken, as in literally Bunny was not delivering content out of the CDN? Broken like that or broken in a different way?

[00:04:38.140] – Dejan
So so what happened was we we dropped a bunch of traffic in just a few seconds, so we went from about three, two hundred fifty gigabit, if I remember correctly, to something like one hundred. So I would say everything was pretty much broken and, yeah, then a panic for a few seconds and how do we solve it? So it turns out we crashed at the DNS when there was an update of the database. It ended up crashing all of the all of the DNS servers, which in turn crashed the CDN, obviously.

[00:05:17.370] – Ethan
Now, which DNS servers are these DNS servers that you use internally?

[00:05:21.780] – Dejan
Yes. So so this was our own network. So the actual network was designed, you know, with with four different redundant clusters, there was unfortunately a bug in one of the. One of in one of the software libraries that we use inside of the DNS, and that just ended up exploding everything.

[00:05:47.520] – Ethan
So corrupted, so you ended up corrupting through this this change all four of your clusters of DNS servers. So you had redundant corruption.

[00:05:56.550] – Dejan
Yep. So, you know, usually when you deploy something, you would do it on a small number of server first. And then if that goes well, you do it more and more and then you go global. And we really try to design all of our system around this. But the issue here was this really happened in one of the libraries that. That we introduced recently, because we we we have this SmartEdge engine, which processes all of the data from our global traffic in in in like almost real time.

[00:06:33.970] – Dejan
And then we send that to the DNS to kind of process and route accordingly so we can get good routing. And in the past, we use JSON here. So as you know JSON is not really super, superefficient for a for transmitting a huge amount of data. So we had this spikes of CPU usage and garbage collection and we thought, let’s do this a bit more efficiently. So we switched to to a library called called Binary Pack.

[00:07:09.220] – Ethan
Not something you wrote just it was a third party library that would be more efficient alternative to Jason when you’re ingesting data.

[00:07:15.210] – Dejan
Yeah.

[00:07:15.690] – Dejan
Yeah. So then things were great for a couple of weeks. We had much less CPU usage. We have much less traffic, we had less garbage collection. And then as I wrote in the post, then suddenly we have nothing.

[00:07:34.000] – Ned
The most efficient thing isn’t, isn’t it? Nothing is the most efficient. Well, I think I kind of see what what happened here is you did the the slow rollout of this new library and it seemed like everything was good. So now you have this this library in place globally and you roll out this new update. And unbeknownst to you, there’s a bug in the library that causes corruption in the file.

[00:08:00.000] – Dejan
Yup, so so despite, despite the actual code being designed workaround around that, there’s no real way to catch and handle a stack overflow exceptions. So we introduced this single point of failure through basically a library that we didn’t test well.

[00:08:23.950] – Ethan
So so literally the DNS servers were crashing due to a stack overflow and what, rebooting themselves. And just crashing over and over.

[00:08:31.820] – Dejan
Yep. So so the fun part there was. The fun part there was that because the DNS crashed, then we couldn’t deploy the DNS updates anymore.

[00:08:45.640] – Ethan
Because you couldn’t find you couldn’t what do name resolution from to to figure out where to actually push your updates to, to fix the problem.

[00:08:54.580] – Dejan
So basically how, how, how we designed the system is the deployment system takes stuff from the CDN and storage. Right. But then the CDN was dead because because the DNS was dead so suddenly we’re stuck in endlessly rebooting loop of one hundred servers that just kept pulling that broken file, broken file. And, you know, it was it was a bit chaotic. We try to roll back, but the file was the files there, but the deploys didn’t go through.

[00:09:29.560] – Ned
So OK, OK, so you have you had the older version of the file, but you’re using a deployment system to redeploy the servers. And because the content distribution system relies on DNS to function, you can’t do the distribution to get the DNS servers back up to get your content delivery working. So yeah, it’s just this very difficult loop you’re in. So how do you break out of that? Because I would just throw my hands up in the air and go and pour myself some whiskey or something.

[00:10:01.220] – Dejan
Yeah. You know, initially initially we tried to just roll things back then when that failed, then then we realized, OK, we are in big, big trouble. Right. So how do you get up from that? Well, the the DNS crash crashing the CDN, the CDN crashed the storage, then everything was rebooting and that crashed the API as well. And how we ended up approaching this is really we we did we took all of the automation out and we just did like a manual deploy to all of the DNS to to to a small number of DNS servers, really to just get something back up.

[00:10:47.680] – Dejan
But then because CDN was dead then that corrupted. The that that corrupted some of the files that were needed by the DNS. Again, which created more issues, so so we ended up just throwing that out, compiling new code just just to get things running re throwing out geo-routing entirely. So so we just threw that. We have like we have a couple of big POPs and we just routed all of the traffic, like the biggest POP we had.

[00:11:19.970] – Dejan
And we were just like, OK, let’s just get DNS running and then hopefully you can pull some updates. And that still didn’t work. So we we just ended up throwing all of the deployment system into the trash and just recoding the deployment system as well. So that was probably in what everything happened in about two hours.

[00:11:44.660] – Ethan
According to the blog, you were back up in about two hours. So my word.

[00:11:47.780] – Dejan
Yeah. So so we ended up just deploying, using the storage from a third party service, pulling that to the DNS. Then that’s slowly revived the CDN. Then the storage was extremely unhappy because we were pulling hundreds of gigabits from the CDN and that was also trying to pull tens of gigabits from the storage and probably the hard drives were crying and, you know, kind of a gradual climb back to normal.

[00:12:25.370] – Dejan
And, you know, when stuff is on fire, despite trying to be to keep your head above the water, you know, you make mistakes. So if you want to read if you want to get all of the details, it’s probably better to just read the blog. But, you know.

[00:12:44.060] – Ethan
Oh, yeah, you mentioned geo caching. And I know one of the things you mentioned in that blog post was the problem of when you were trying to get some of the DNS back online. Some of the geo-routing ended up sending a ton of traffic globally to a really small POP that wasn’t able to handle the load, for example, just stuff like that that just kept, you know, this this cascading set of different failures and circumstances that that happen.

[00:13:08.070] – Ethan
And you remind me of like what happens in a in a data center if you have a massive power outage and then you’ve got to bring the data center back online, you don’t just turn the power on everywhere. You got to bring everything up a little at a time or you’re just going to be crashing, crashing, crashing, blowing circuits, et cetera. Yeah. So you got it back in about two hours, it sounds like, which is an extraordinary effort.

[00:13:30.350] – Ethan
You ended up writing about it in great detail. And folks, if you’re listening and you want to know the details, we have a link in the show notes, Day Two Cloud dot IO, go to this episode and click through the link there. Just do a quick search for Bunny dot net and an outage in this blog post that is very transparent will pop up and you can read in even more detail by Dejan as he as he covers what all went on.

[00:13:53.990] – Ethan
But this sets a good foundation for us, Dejan. We want to explore. You had a lot of automation there. You had a lot of things that are like, yeah, this is the way you’d build a system like this, this big complex monster that’s global managed by a small team. You’d want a lot of automation. You’d want a lot of systems that just the system takes care of itself. And yet you ran into these challenges with this one unforeseen circumstance.

[00:14:21.290] – Ethan
So let’s walk through let’s have an architecture discussion. Let’s have a design discussion, because I know you also mentioned in the blog that you were exploring all the different ways to rearchitect the system so that this kind of thing doesn’t happen again. Let’s start with the issue of dependencies. You ran into a state where one system impacted others in unforeseen ways. Did you take away any particular lessons or have thoughts on, you know, that challenge and give some tips on what others of us could with our own system to take care of what we should be avoiding?

[00:14:50.660] – Dejan
Yeah, sure. I think every starting Bunny, right, we we try to do everything right. You avoid internal dependencies because, you know, maybe maybe you’re starting a project. It’s not super stable yet. OK, I’m not going to use our storage as might go down to deploy things. Let’s use something external. But as soon as the project grows and you’re more confident and you have a stable system that’s been running for years, you know, it’s an easy trap to get into this mindset of, OK, yeah, we built all this cool technology and let’s use it in our own systems.

[00:15:27.900] – Dejan
And that’s probably the biggest takeaway here, was to really stick to that original idea that we had initially. Don’t build your own internal systems on top of your internal systems. And I think maybe Amazon was a good example a few years ago. And they they have a super massive infrastructure and it just all depends on itself. And usually it’s DNS and that’s why there’s the joke. It’s not DNS. There’s no way it’s DNS. It was DNS. So I think that’s really the biggest issue of the dependence is really this trap, which is then furthermore more dangerous if you if you start to build circular dependencies.

[00:16:16.420] – Dejan
And that’s that’s really what happened in our case. Right. So so we had one system rely on the other system. And it just went down crumbling. So like you mentioned earlier, with the electricity. Right. So I was thinking sometimes it’s better to kill all of it because the CDN is crashing the storage. Right. So, you know, you can’t deploy anything from the storage. It’s best, just like maybe sometimes kill all of it.

[00:16:46.990] – Dejan
Just bring back a small section, maybe just the internal systems and gradually heal, heal everything. The DNS is especially an interesting example here, because a lot of the times I’m not really sure if you’re familiar, but when you’re running a big DNS cluster, you have a lot of resolvers sending queries over and over. And when the DNS crashes, you know, they or it gets overloaded, they start to retry and retry. And that just creates more load.

[00:17:19.900] – Dejan
And, you know, if if you’re in this situation, then, you know, sometimes it’s just better to kill it because because it’s just cascading over and over. That’s the right word. But.

[00:17:33.280] – Ned
That’s definitely the right word, and we’ve seen that with, like power grids, right where it’s just there is a circular dependency of some kind. One small section of the grid gets overloaded, shuts down, and then the cascading failures sweep out to the rest of the power grid. Take everything down to really the only way to bring it back up successfully is to do it slowly and one piece at a time so those dependencies don’t start overlapping again. So it sounds like that that was one of your takeaways, was, hey, realize what a bad situation we’re in and bring things down and slowly back up in a controlled way.

[00:18:09.010] – Ned
So you’re not trying to fix the thing. It’s like trying to fix a car that’s on fire and already running, like maybe turn off the car and put out the fire and then fix the problem. Right.

[00:18:22.690] – Ethan
DNS is an unusual one. As you said, Dejan in that resolvers are just going to keep trying because there’s nothing else they can do in that transaction until they resolve that hostname. So they’re just going to keep trying to resolve that hostname until the transaction gives up. But there could be an awful lot of queries that happen there, especially if you’ve got a short time to live for a particular record and those that you can’t benefit from caching as much in a hierarchical structure, which sounds like you were dealing with that to some degree, yikes.

[00:18:52.570] – Dejan
Where our DNS may maybe even a bit special because we do some super, super complex logic. So it’s not just return a record that’s really easy, but we do like a huge, huge amount of calculation in real time.

[00:19:11.380] – Ethan
Yeah, something metrics driven for geo-distribution of queries and responses, that kind of thing, because you’re trying to you’re trying to load people off to different POPs?

[00:19:21.520] – Dejan
Yeah, we, we do actually something quite, quite interesting where we where we look, where, where every user’s users are going. So if if they don’t have a lot of traffic in Australia, for example, it doesn’t make sense to send, you know, one user to to the closest spot, maybe because that closest spot might not have the file. But if you get like one request per day there, it might make sense to just send them to Sydney because then you reduce the amount of cache misses and we try to make a system that that kind of monitors all of this in real time and and make sure that the user gets the best performance possible for that.

[00:20:05.650] – Dejan
And we had some quite good results. But it’s it’s unfortunately a very much, very intensive calculation logic in the back of that. And the amount of data that we transfer to the DNS is that. That that’s kind of been what what was the root cause of of this issue, I guess

[00:20:28.380] – Ethan
I was going to ask you why not anycast, but you just answered the question because the calculation you’re doing has a lot of metrics and your algorithm is more complex than just just advertising an IP globally and it’s going to be great. You’re really thinking hard about how to route people and the different conditions upon which you decide where to route people. So you have to have you have to use DNS in that case.

[00:20:49.690] – Dejan
Yeah, yeah. We also do anycast as well. It’s in the US maybe geo-DNS isn’t that good. So we use anycast there. Somewhere we use latency based, somewhere we use just geo DNS. You know, really we really check each region. So, so the DNS and our side is basically like the core engine of of performance. I would say.

[00:21:17.400] – Ethan
We get what you’re saying. It’s it’s the thing, but it’s also the weak link, the Achilles heel. Yeah. Yeah. That happened to bite you in this case because everything depends on it to get your traffic routed around to where it needs to go. Dejan, another part of this story to me is automation. It feels like there’s a lot that was happening in the background. You kick off some process and a lot of things just happen, which, when it all goes right, is amazing. And that’s what we all want out of our systems, automation.

[00:21:45.750] – Ethan
But it is scary when it does things that you don’t want it to do, which in this case was continuing to load a corrupted file crash reboot and, you know, and on and on in this endless loop. So how do you add how do you add failsafes to the process? What do you have to do to make sure that if things go bad for the automation process, that you can maintain control of the system?

[00:22:05.580] – Dejan
Yeah, so so first I want to add that I’m a huge fan of automation, but in this case, I think it’s really hard to prevent this. But it is possible. So what we try to do is anything that’s automatable, if that’s a word, everything that’s possible to automate, we try to automate. But anything that’s not possible to automatically do to reliably automate is maybe not a good idea to automate. So what that means is whatever works ninety nine point ninety nine percent of the time is good.

[00:22:46.260] – Dejan
Everything else is bad. If you can’t if you can have predictable automation, it’s probably doing more bad than good.

[00:22:53.490] – Ned
Right, right. Right. Yeah. The cases where the automation is stable and consistent and reliable, that that’s awesome. It just does the thing that you want. But usually what I find is once the logic or the like, the logic tree gets too complicated for a piece of automation. There’s just too many potential failure scenarios in there. And it’s better just to have a human do it because at least they can kind of apply some additional logic to it.

[00:23:18.900] – Ned
I thought AI was going to fix all of this, but apparently not.

[00:23:23.300] – Dejan
Well, it might be interesting to see AI automate things until it decides, oh, something something’s working really well. Let’s let’s shoot everything to Madrid. I think how we try to do this, we we just assume that everything will break. Right. So every part of the system should always assume that any other part of the system is dead, always, and just try to do its own thing.

[00:23:55.260] – Dejan
So we we really do a lot of. I would say micro services, they’re not really micro services, they’re just separated services and yeah, I think it does add complexity, but if done right, it can be really it allows a small team such as such as our team to run hundreds and hundreds of servers. And it just runs just, you know, we. Very rarely touch anything. So the system just manages itself, all the load, balancing everything, it’s just, you know, once things do go wrong, it’s really, really dangerous.

[00:24:36.290] – Ned
We pause this Day Two Cloud podcast for an important message from one of our sponsors. Cloud is hard, predicting cloud costs is even harder. What you need is a friend to help out. What you need is zesty, zesty uses AI to proactively adapt cloud resources to real time application needs without human intervention. Now, I know I know A.I. is a term that gets thrown around a lot. There’s a lot of hype and a lot of disillusionment. And that is because vendors try to get A.I. to do everything instead of the thing that A.I. is actually good at.

[00:25:17.690] – Ned
And that thing is monitoring and optimizing repetitive and identifiable events. Guess what cloud cost optimization is? A problem of monitoring and optimizing repetitive and identifiable events. Zesty is using real deal A.I. in the way it was intended. Zesty’s technology leverages A.I. analysis and autonomous actions based on real time cloud data streams to automatically purchase and sell AWS commitments or in much plainer English, zesty looks at the real time data from your cloud resources and then makes smart purchasing decisions to save you money.

[00:26:00.920] – Ned
And you don’t have to do anything. There’s probably some alarm bells going off in your head. You just handed zesty an unlimited credit card and permission to use it. That’s scary. Fortunately, Zesty offers a buy back guarantee for any overprovisioned commitment. You’re not going to get stuck with a pile of reserved instances you don’t need due to a glitch in The Matrix. That’s because Zesty makes money when you save money. That’s right. Their fee is based on the savings they provided to you.

[00:26:35.780] – Ned
If you’re not saving money, Zesty isn’t making money. That’s what we call friends aligned interests. The result is an average savings of fifty percent on EC2 and a mere two minutes to on board your account. If you’d like a friend who saves you time and money, go to Zesty dot co and book a demo that zesty dot co to book a demo and put your cloud cost optimization on autopilot. Now back to the episode.

[00:27:07.210] – Ethan
You you mentioned a bunch of systems here and and they’re distributed systems, you’re dealing with a bunch of servers, systems distributed across those servers in the form of not quite micro services, but, you know, compartmentalized services that do different things. My specialty is networking. And in networking and network design, we talk a lot about fate sharing and avoiding fate sharing and separating your failure domain so that if something breaks, it doesn’t take the other thing with it.

[00:27:36.280] – Ethan
You deal with distributed systems. That’s the world you live in. All right. Talk to us about fate sharing then. You had a system where effectively you ended up with one massive failure state because of all these dependencies across this distributed system. Is there are you redesigning or rethinking some of your system at this point to improve fate sharing in the distributed system?

[00:28:00.310] – Dejan
Yeah, we actually are. So, you know, as as I wrote at the end of the blog post, we we’ve kind of dedicated a couple of weeks really to just reading everything, what we did, what we’re doing, what we’re going to do and how to fix this this issue, because you know that the DNS was kind of our single point of failure here, even though, you know, technically all the technically every system was designed to work by itself.

[00:28:32.110] – Dejan
Right. So it can assume everything’s dead. But if if the whole system collapsed at once, then you have and then the DNS is kind of the weak link. Right. So what we’re doing right now is we have a really nice system in place that allows us to actually cut off DNS entirely. So this has been going on for for a couple of weeks where we just go through all of our system, like all of our internal system and anything that’s very critical.

[00:29:03.730] – Dejan
We are just moving DNS out of it. So, for example, all the CDN nodes in the past used to kind of connect to each other and connect to the optimization system, for example, or the storage through DNS. Right now we’re just actually removing all of that and making the the nodes kind of independent. They they connect to the same API that the DNS does, not the DNS does. And we just have everything everywhere. And every system just knows what to do with that.

[00:29:42.640] – Dejan
And if the other system dies, this system still has all everything it needs to continue running.

[00:29:50.740] – Ethan
Is that something just like like a HOST file with static mappings, or have you moved the name resolution off to a completely separate DNS infrastructure?

[00:29:59.200] – Dejan
The DNS infrastructure remains the same. So so it’s not a HOST file because it’s not the same. The DNS structure isn’t the same. The DNS was now, is now just reserved for for for running the CDN, running the users, doing the stuff it needs to do. But internally we actually just connect the API. And since we wrote the majority of our own software, because we we have some quite unique infrastructure, I would say we did it in a way where we can just do it in code together, get a bunch of information from the API, save it locally, maybe in a database, just keep it there just in case and just load that.

[00:30:44.320] – Dejan
Use that. So, for example, now every edge node knows where all the optimization servers are and they just, you know, they can select one. They know if it’s online and all of that. So no, no DNS anymore.

[00:30:58.450] – Ethan
You don’t have to go through a service discovery process. You’re pre populating that information. Yeah, there’s a notion of this with software defined networking in certain situations where rather than asking routers and switches and so on to learn where remote destinations are, you pre populate their forwarding tables with that information because you’ve got some central knowledge store that knows all that information. It just tells the system this is where you’re going to find the things and gives you a bunch of pretty pretty interesting capabilities, too, because you can very granualary control what’s going on.

[00:31:31.360] – Ethan
If you have this kind of brain at the top that is pushing down into the system the way you’re going to move through it. So you’re doing it sounds like you’re doing that rather than lean into DNS, you’re you’re pre populating. So you’ve you’ve eliminated the dependency on DNS. And it just occurs to me also given yourself some interesting powers.

[00:31:52.070] – Dejan
Well, I was just going to say that it’s actually actually we we discovered some very interesting possibilities here. And then in turn now we have some really quite exciting and unique projects going on thanks to this. So, yeah, you’re right, actually. So, you know, it’s also now better because the CDN can actually do load balancing can do monitoring a lot of the stuff there, but that’s just the basics. So maybe some more interesting things we can do is, you know, since we know where everything is, we can do retrying as well.

[00:32:35.480] – Dejan
So so it’s quite easy to to go to one location and check if something’s there, maybe even the routing is better to the storage and the performance is better as well because we don’t need to do any lookups. Especially in a dynamic system that that was adding a little bit of latency, but, you know, the faster the better.

[00:32:56.530] – Ned
Yeah, right. Right. Because you’re your DNS system was doing those the complicated logic for every lookup request that came in. Now that’s bypassed. It just knows these are the storage servers I want to connect to. These are the deployment servers. How often is that information refreshed on the edge nodes and where does that refresh come from?

[00:33:16.870] – Dejan
So that the actual refresh right now comes from the central API, which which is has I would say it’s like database, almost a database and an API. And then all of the logic happens on the edge. So how often this happens is now because we don’t really need to change to sync everything. Right. We can just push data in almost real time. So it gives us a lot of opportunity now to ignore DNS caching, ignore anything. We just, you know, get everything in real time, almost.

[00:33:50.590] – Ned
Right. Right. You’re not relying on on a poll refresh cycle. You can just if there’s new information, just push it out to the edge nodes where it’s pertinent to those edge nodes. And you have that up to date information. You don’t have to wait for a ttl to expire or a cache entry to expire. That’s kind of cool. So it’s almost like the failure provided an opportunity to improve your systems.

[00:34:14.320] – Dejan
Well, yeah. I mean, that that’s that’s that’s what you should do, right, when you when you fail to learn from it. Right. So this for us is very exciting as well, because we do a distributive edge storage which basically hosts files around the world and now the CDN knows exactly where to connect. It doesn’t need to do a lot of extra stuff. So, yeah, quite cool. I’m not going to say I’m happy that it happened.

[00:34:45.010] – Dejan
We did the we we we tried to make the best of it, I would say.

[00:34:50.140] – Ethan
Yeah, well, it’s funny the things that you don’t think of that crop up during that failure scenario, I was dealing with a network that had a management network spread across every switch and router in the data center. We had a broadcast storm happen on that one management network and we lost the whole data center because of that. It was a very simple design flaw, but it didn’t really occur to any of us. We were all worried about the security and segmentation.

[00:35:14.650] – Ethan
This is the management network. So we manage this. We had all the ACLs in all the right place and everything was perfect as far as that went, except that we had one network common to all of these devices. And in that broadcast storm, the control plane, CPU got clobbered on all these devices. Data center went down was horrifying things that you learn when the failure happens that you never thought about. Which leads me to a question about testing, Dejan.

[00:35:36.000] – Ethan
How do you what are your thoughts on testing? Because, of course, we don’t want this to happen in production. But so in theory, if we improve our testing regimen, we find things before it blows up on us in production. So how how could we test better?

[00:35:53.200] – Dejan
Yeah. So I guess first I want to say always test as much as possible. Right. But then only when it makes sense. I think maybe sometimes we’re in a danger where we where we get super confident with testing and then we rush the production. So. So I would say first, first thing is not to go too confident with the testing. Now, it’s important to also understand what you’re testing. Right. So in our case, for example, it was really easy to say, you know, this is stable, this works, you know, but then some garbage data just corrupted everything.

[00:36:32.800] – Dejan
Somebody running automated tests, you know, OK, everything’s working fine. Right. So if so so I would say it’s important that you you you really think about how and what can go wrong, because in the end, the test will only show things that you already planned.

[00:36:53.500] – Ethan
Or you’re you’re you know, for me, with some of the code that I write, one of the things I tend to be lazy about are things like input sanitization, it’s like it’s it’s just me.

[00:37:01.600] – Ethan
I’m the one who’s. Yeah, I know what I’m putting in there. It’s fine. I can trust myself. And so I tend to get lazy on that stuff. And of course then the code makes it out into the wild to a wider audience and you don’t know what it’s going in there. You don’t know if some someone’s going to start poking at the code for vulnerabilities and so on. So there’s but but then also that question of getting your head around what what could go wrong and then creating test for that, sometimes it’s hard to know what could go wrong until it goes wrong. And you’ve had that horrifying experience. So that’s it to me. That’s just a bit of a tough one.

[00:37:35.710] – Dejan
I think I have a good background here because initially when I started, you know, when I started programing, I, I was actually working on mobile apps. That was over 10 years ago. That gives you a really nice perspective where, you know, whatever is inputted or whatever the user puts in is garbage and it’s broken, it’s going to break your everything’s going to break. So you have this mindset and I have this mindset even now. Right.

[00:38:10.160] – Dejan
So everything’s broken. So, yeah, I mean, I think that’s that’s really important. But I think around testing, for example, maybe in our case, it’s more interesting to talk about how to test the infrastructure. Right. Because you know what? Testing code is one thing, but then testing how that code behaves under load is a different thing. So, you know, at some point you have all the tests, everything’s passing. You’re confident you have like years and years of experience of what could go wrong.

[00:38:43.930] – Dejan
You you covered all the cases you ran. You ran stuff on staging and everything’s working fine. But then you then you put it into production, you know, somebody breaks it in the first five minutes. So I think it’s an incredibly important thing is where what also was kind of pointed out by Hacker News is to always do Canary testing. So always starts like on a couple of servers, maybe one server. Just see how that goes, you know, then slowly go bigger before you go global.

[00:39:21.040] – Dejan
And I think this is maybe a bit better approach, better way to testing things on a such a complex system because stuff will break and it’s important it does not break at the same time. You know in our case, which we always try to do this, like right now we’re testing a couple of servers with a new update. We always do one DNS the same time. So everybody was super upset at us, like, why didn’t you just try to test on on NS1 one first or NS2?

[00:39:56.390] – Dejan
And and I’m reading that I’m like, but we did it all. But that was that one dependency that kind of turned out to be like a single point of failure. And then again, somebody pointed out, you know, why don’t you do like I’m not sure what the phrases, but, you know, that’s the library with garbage data. And again, you know, then you need that. Then that brings you back to understanding what you’re testing.

[00:40:26.500] – Dejan
Right. So so that’s why that’s why it’s important to test on small scale, because. The the unit testing, integration testing is only as powerful as what you can think of in the beginning when when you’re designing it anyway, and also you’re basically catching bugs that you’re trying to avoid anyway.

[00:40:51.970] – Ethan
And testing something running on a single server is one thing that is like you pass a first tier of test as soon as you move to distribute it and you stick a load balancer in the mix, that’s another thing. Now, if you add DNS to it, now you’re testing kind of a whole different thing. It’s like there’s components you can test and build out. But as the system grows in complexity, capability, scalability, the number of interdependencies there are and the number of interesting things that can go wrong grow, and it gets harder and harder to get your head around the ways in which the network could fail or a slow failing hard drive could fail.

[00:41:28.960] – Ethan
Well, hard drive, whatever you don’t have, et cetera. And so so testing is this thing where you’ve tested one scenario that’s small. And then when you expand the system to be distributed and scalable, you kind of testing something entirely different.

[00:41:45.300] – Dejan
You know, you mentioned one hard drive, one hard drive dying. Now, imagine having having five hundred servers and then, you know, every every few minutes something is potentially dying. And without automation, as we talked earlier, that would just be a really, really bad situation. Right.

[00:42:07.180] – Ned
I think one of the things that has been pointed out before is this certainly is an original thought on my part, but that anything that makes it through all of your testing and staging and QA ends up in production and then you’re in effect testing in production. So anybody who says they don’t test in production hasn’t realized that the ultimate test is running it in production. And for me, there’s a there’s like diminishing returns on how much testing you do once you get past a certain point, because the time and the costs of doing a full end to end test like you would have to replicate your existing production system and run it side by side.

[00:42:47.440] – Ned
That’s a tremendous cost. So the benefit would have to be a corresponding weight with the cost of doing that level of testing. So at a certain point, you say this is enough testing and we’ll just deal with whatever happens in production.

[00:43:03.510] – Dejan
Yeah, I mean, I think that’s why I also mentioned to understand what you’re testing. Because if if if you’re wasting. Huge amounts of time on something and, yeah, there there can really be diminishing returns and a lot of the times I would say issues happen under loads that are really kind of hard to test, even especially like in something like our case we have where we’re pushing. We’re pushing edge servers to the to to deliver like to read from from the hard drives of, I don’t know, multiple gigabytes per second.

[00:43:42.210] – Dejan
I know that that’s really hard to actually test reliably because they’re just being blasted with so much traffic and sometimes even our customer has like a dying origin. And if it’s a big customer, we just got like millions of connections. And, you know, you’re not going to test for that, at least not easily. Maybe once you once you’re confident enough, maybe it’s time to do a small production run and see how things go.

[00:44:16.810] – Ethan
You know, it’s it’s interesting. What were pointing out Ned about the production is really the final test. It’s not the tests all succeeded. Yeah. We’re ready for production. It’s OK. We’ve done what we can do now, production is the thing. Dejan, right the load you experience in production is probably unlike anything you’re going to be able to manage to generate and test, and that’s going to reveal certain problems. So, all right.

[00:44:39.400] – Ethan
I’ve been in this situation as well, going to the new thing and we’re ready for production and management would always say, I want to I want a roll back plan. If things go bad, we’re going to roll back. I’ll write one. But the reality is, I want to get your take on this rolling back. Very often it’s just impossible. You’ve gone forward. You’ve got to make the thing work now. And rolling back can be pretty can be pretty tough, depending on what you’re trying to do, particularly with infrastructure upgrades.

[00:45:08.740] – Ethan
Do you have a take on rollback plans, whether they’re worthwhile or not? And yeah, just what your what are your thoughts?

[00:45:15.690] – Dejan
So I would say in our case, we’re kind of lucky because most of the systems can be rolled back. But I understand that that’s maybe a bit special. You know, if you if you look at all the big outages, it’s usually like a software bug or software update, maybe on a router or something like that. Maybe somebody deployed one wrong character or something like that. Then, you know, the whole thing kind of crumbles and then it’s a bit harder to do to just roll back.

[00:45:45.910] – Dejan
And even in our case, you know, the the super solid rollback plan that I described just crumbled, right? I’m a big fan of trying to make sure that you can roll back. So even even right now, when we’re deploying, like we we’re doing like a new kind of set of updates, we actually have a roll back in place that’s just a toggle in the database. So we can just say, OK, go to the new system, OK, something’s broken, go to the old system.

[00:46:18.310] – Dejan
Right. So that that kind of things are super useful. But maybe maybe it’s a bit special case, because if you’re doing maybe software updates, then, you know, once you switch things over, it’s it’s over. Right. So I would say depends I guess.

[00:46:35.540] – Ethan
Yeah. And that’s kind of my take. I’m a big fan of rollback plans as well. It forces you to think through the update that you’re doing and how you would get back to a known good state if you could, but also reveals kind of the go, no go point where it’s like if we hit this button, we do this step, rolling back, it’s going to be, oh, it’s going to take us longer to roll back than it would just be the hammer through and try to make everything go if something goes goes south.

[00:46:59.590] – Ethan
So it’s more complicated than than all of that. Yeah. I’ll just roll back. It’ll be fine. Maybe, maybe it’s just these complex changes especially can be can be really tough.

[00:47:11.200] – Dejan
So I think maybe it depends on where you’re coming from. So you’re like a management position and you don’t really understand what the rollback means and where it makes sense, then maybe you’re just pushing for it. But but maybe, you know, if you’re actually designing the system as well, then you better understand what’s actually roll backable and and what’s not.

[00:47:33.640] – Dejan
And, you know, I’ve definitely been in a situation where I just press the button and closed my eyes and hope everything doesn’t explode in my face. And I hate myself for the next three months. And yeah, thankfully, things went OK. But I understand your position that you described earlier.

[00:47:57.790] – Ethan
Well, there’s a couple more things I wanted to get into before we close the show today. One is gray failures. You alluded to this earlier. What if you’ve got hundreds of servers out there? Something’s broken, probably constantly. So how do you? I mean, we can test for things that are like hard down. We know how to test link is broken server crashes. Those things are easy to test for, but most failures aren’t really like that. Something slow or kind of dodgy or a network link’s throwing errors here and there. How do you how do you consider design to accommodate that reality where you’ve got a partial failure in the system somewhere?

[00:48:31.030] – Dejan
Yeah, I guess. I guess I like to joke that, you know, when you have five hundred servers, you everything’s just on fire all the time, so. It really brings me to automation, right, so I think the best way to approach this is just to assume that everything’s broken and make sure that whatever can be broken is monitored in an automated, automated way. So, for example, a disk dies, you get an alert, the system either shuts it off, make sure it’s not being routed to or something like that.

[00:49:10.610] – Dejan
A server dies, you know, to take it off. It’s really important to have this kind of automated monitoring, especially when you’re doing something at scale.

[00:49:20.810] – Ethan
And the reaction isn’t simply send an alert to a console, but it is take an action to take that system out of, don’t service request, take this thing out of the pool, as opposed to relying on a human to take it out of the pool.

[00:49:35.150] – Dejan
Yeah. So in our case, it’s maybe maybe it’s an interesting set up. I’m not sure. So what we do is when we have I don’t know and edge server we have, I don’t know, 10 disks on it, 16 disks on it. And if one disk dies, you know, the the the server itself keeps monitoring that, and it tells the server like the nginx immediately, look at this, this disk is dead stop using it.

[00:50:03.370] – Dejan
And then it also sends us an alert and it’s fine. We can keep that server running. Meanwhile, if we just sent an alert, the server would be broken and spewing errors. And that’s not that. That’s that that’s a great way to to to to to get depressed because you’re just fixing everything all the time. And then then then kind of it goes up a couple of levels as well. So, you know, a disk dies. OK, let’s turn it off.

[00:50:39.130] – Dejan
But then the server dies. OK, let’s turn it off as well. And we we do that on the DNS level. So for example, the DNS monitors the servers. A third party service also monitors the servers. And, you know, if there’s any issue there, we we just turn that server off. So the DNS is smart enough to understand what’s working, what’s not. And and I think that gives us a really nice way of.

[00:51:08.800] – Dejan
Not fixing everything all the time because it just happens automatically, but it’s also important, I think, here to mention maybe. That it’s maybe important not to trust all of the monitoring sometimes so that you you can end up in a situation where you get a false positive and suddenly everything is burning again just because, you know, maybe a third party service or your own service reports that everything’s dead for some reason. So. Test the tests.

[00:51:45.510] – Ethan
Oh, now that’s really interesting. You’re reminding me of my days dealing with load balancers and writing sufficiently complex and appropriate tests to get an accurate assessment back of the service that you were testing, which took some doing. You can’t just like count. Is there a TCP listener there? Yeah. What does that prove to you? Nothing if you’re trying to monitor a Web service, it proves there’s a listener out there, doesn’t prove that the servers are delivering data or it’s the data that you want or any of the rest of it, you’ve got to have a much more complicated test to pull that off.

[00:52:18.480] – Ethan
And the false positive one is is interesting. The service is dead. No, it isn’t. It’s fine. Right test the tests. Really interesting.

[00:52:27.180] – Dejan
Yeah, well, it sounds bizarre, but, you know, sometimes actually people use the test the test is as an example of somebody doing something stupid. But sometimes that actually makes sense.

[00:52:40.640] – Ethan
Especially with the more complex of the test that you might have. Well, OK, one more question, Dejan, and that goes back to getting your design reviewed, recognizing things like single points of failure. How do you how do you do that? And I ask it in this context.

[00:52:57.560] – Ethan
I’ve done a lot of designs where sometimes you’re really close to it. And because you’re just so many days, you’ve been writing documents and thinking and there’s a whiteboard and there’s diagrams and there’s meetings and you’re tired and you’re like, this is good, right? It’s good. It’s golden. We thought of all the things we figured it out, this design. We nailed it. But then you put it in production and you didn’t nail it because you forgot something, something you didn’t see.

[00:53:20.990] – Ethan
How do you how do you how do you prevent that from happening? So you see the flaws, the big design concerns. Do you bring in a, you know, a consultant to third party to kind of look at things or, you know, leave it for a week and come back? Do you have a strategy for that?

[00:53:35.210] – Dejan
I thinkg the most important part is actually the design phase. Even if you kind of try to fix it after the design phase, chances are, you know, that things might already be in a state where it’s hard to pull back. So I think that the planning and designing itself is probably the most important stage here. And I would say the best way to describe the best way to describe it is I have ten years plus experience in breaking things. And, you know, the more you break, the more you the more you learn what can go wrong.

[00:54:14.900] – Dejan
So. Then that then then with years and years of breaking things, you kind of get like an intuition or maybe not sure that’s the right word, but you I just I think things go wrong. But that that’s that’s more about the design phase. Right. So maybe to actually answer your question a bit better, I would say it’s really good to just make something. And before you put it live, like you said, just walk away maybe for a couple of days, then look at it again and.

[00:54:55.200] – Dejan
Magically, usually you will find something that you completely miss, because that’s just how we work, we really get into the zone, you know, everything’s working, everything’s magical. And all that just happened to me recently. You know, I’m working on, like a project. And it’s it’s it’s an amazing piece of technology. It’s it’s it’s really amazing. Everything works. I tested everything like 50 times, then you turn it on and it just starts smoking.

[00:55:25.320] – Dejan
So so it’s really important, I think, to walk away and maybe be patient. Don’t rush. So I would say maybe our case, we’re still small teams, so, you know, third parties are maybe a bit too much for us, but if you’re in a bigger company, maybe that that makes more sense. But I think, you know, somebody third party can bring an extra set of eyes and extra set of thinking that you might not have.

[00:55:55.770] – Dejan
But it’s if you have a super complex system, maybe like ours, and it can take quite some time, explaining and understanding how it actually worked. So so it’s it’s maybe easier to do spot stuff internally.

[00:56:11.970] – Ethan
Yeah, there’s a there’s a tradeoff there. But I even if it’s not an external source, maybe it’s someone else in the company that you work with. It’s like, hey, can you come over for a couple hours? I want to show you something that we’re kicking around and I want you to shoot holes in it just and, you know, maybe they don’t don’t have your area of expertize exactly, but they know enough to be able to shoot some holes in it, show you the things you’re not seeing, because that’s one of the things I learned in design, is the more people you involve with their shared pool of experiences, the more likely some of these experiences are going to come out like, oh, you’re choosing that maybe you want to do it like this, because let me tell you a story, you know, and then all of these little things percolate up and it ends up with you distill the design down through all these different people and end up with something ultimately that’s better than what you started off with because you’re using and leveraging all these other experiences of all these other people.

[00:57:08.580] – Ethan
It’s just an ego thing. Some people are like, I don’t want anyone else to look at it because I’m awesome. And, you know, you’ve got to get over that and know that you can just benefit from what other people bring to the table. And that third party perspective is going to see things you don’t see.

[00:57:23.550] – Dejan
Yep, yep. Yeah, that makes sense. And hopefully what happened to us will actually help somebody to maybe avoid a similar fate. You know, I think you’re right. I think maybe just even just presenting the idea with somebody and just giving like a high breakdown is already enough for somebody to just go. Oh, yeah.

[00:57:48.180] – Ethan
It can be. Yeah, certainly can be. Well, Dejan, man, we appreciate you coming on and having this chat. This was good lessons learned. You guys are transparent, which. With all the companies that put a spin on what really happens, you kind of don’t know and you’re like, you know, I don’t know if I can trust that response. You guys being so transparent, clear about what happened at Bunny dot net that I think was pretty awesome. So if you’re out there listening to this and you want to read again, just go to Bunny dot net. Look at the blog.

[00:58:18.530] – Ethan
There’s an article called the Stack Overflow of Death DNS collapse that Dejan our guest here today wrote have a have a read for even more detail on what happened and and then then send him some virtual hugs out on Twitter, because it was it was a rough couple of hours for those folks. One of those outage were people noticed it was a thing that got seen. So but then again, transparency is a huge Dejan.

[00:58:44.030] – Ethan
Dejan if people want to follow you or bunny dot net on the Internet, how might they do that?

[00:58:48.500] – Dejan
Well, they can either follow me on Twitter, I’m not super active. I’m just I’m mostly just focused on on bringing the company to the next level where we have some really exciting projects right now. But maybe maybe follow Bunny dot net that on Twitter instead.

[00:59:08.870] – Ethan
Good stuff. Thank you very much, Dejan. We’ll have links to all of that, the article and so on on the show notes Day Two Cloud dot IO. And you can also find that at Packet Pusher’s dot net. So, Dejan, thanks very much for appearing today on Day Two Cloud. And if you’ve been listening all the way through to this deep dive on how sometimes things can go wrong and how to make them better. Hey, virtual high fives to you for tuning in.

[00:59:30.800] – Ethan
You are awesome. If you have suggested for future shows. We do want to hear them, Ned and I do. You can hit us up on Twitter at Day Two Cloud show or fill out the form on Neds fancy and rebooted website Ned in the Cloud Dot.Com Packet Pusher’s. This show is part of the Packet Pusher’s podcast network. While the packet pushers have a weekly newsletter for you, Human Infrastructure magazine, HIM is loaded with the best stuff we find on the Internet.

[00:59:54.650] – Ethan
We find all kinds of articles that explain things, give concepts, news that would be interesting to engineers, et cetera. And then we write our own feature articles and commentary in there. It’s a free it’s entertaining. It doesn’t suck. We promise and we respect your privacy. We don’t sell the mailing list to anybody or anything. It’s just for for the community. That’s really what it’s all about. Get the next issue at Packet Pushers dot net newsletter.

[01:00:17.480] – Ethan
And until then, just remember, cloud is what happens while IT is making other plans.

Episode 110