Search
Follow me:
Listen on:

Day Two Cloud 101: Closing The Network/Cloud Gap Before You Fall In (Sponsored)

On today’s episode, sponsored by BlueCat Networks, we examine the technology and human challenges that arise when you integrate on-prem and the public cloud. You can’t continue to do things in the cloud with traditional toolsets and processes. You need to update the tech and the people, including how they collaborate.

We also talk about the role of DNS in the public cloud, and discuss a new report from BlueCat that examines the need for, and challenges of, integrating networking and infrastructure/cloud teams. Our guest is Andrew Wertkin, Chief Strategy Officer at BlueCat.

Takeaways:

  1. Plan architectures collaboratively. If you’re a leader, make sure your teams are doing that. If you’re a planner or practitioner, push for it.
  2. Find common ground. DNS, security, and compliance are good areas where cloud and network teams can meet.

Show Links:

BlueCat Networks Research Report

Bluecatnetworks.com/d2c

BlueCat Networks Blog

@bluecatnetworks – BlueCat Networks on Twitter

BlueCat Networks on LinkedIn

@awertkin – Andrew Wertkin on Twitter

Transcript:

 

[00:00:03.960] – Ethan
Welcome to Day Two Cloud, Oh Ned, we got a good one today. This is with our sponsor, BlueCat Networks, and we talk about some stuff that may be self evident to some of you that are listening that is going from on prem to cloud integrating those two environments. It’s not a straightforward thing. And it’s not just a technology problem, although I guess it is, but it’s also this human problem. And we talk about some of those things.

[00:00:29.430] – Ethan
And then and the second half of the show, we really dove into this relating to DNS specifically, something one of the several things that BlueCat specializes in. Ned, what was one of your takeaways from the show?

[00:00:40.680] – Ned
Yeah, I think what really stood out for me is Andrew Wertkin, our guest, made a point that you can’t continue to do things in the cloud with traditional tools and traditional processes. You have to update both the tech and the people that are operating that tech, whether it’s a process or the way the two different groups collaborate. So that was a big focus for me throughout the entire conversation. Was that need for better collaboration and newer processes.

[00:01:09.530] – Ethan
I got to admit, when that point came up, Ned, I had a little bit of a sad as an old school DNS human going back in the day. Plain old DNS doesn’t get the job done anymore in a modern cloud environment. Enjoy this show with Andrew Wertkin of BlueCat Networks.

[00:01:25.190] – Ethan
Andrew Wertkin, welcome to Day Two Cloud. And we got a report we want to get to that you guys have commissioned and and have a lot of interesting information for us to talk about there.

[00:01:35.580] – Ethan
But before we get to those details, I want to set the show up this way. We know because Ned and I talked to a lot of different cloud practitioners, people that are hands on that. This adoption process is difficult to go to cloud to begin with. You throw in Multi-cloud. It’s that much worse hybrid cloud, same kind of challenges. Getting all this stuff mashed together is painful. If you could kind of summarize it to a few main factors that are making this cloud adoption integration, whether IT practices difficult. How would you describe that?

[00:02:07.640] – Andrew
Yeah, sure. I mean, I think it’s almost an age old problem of scale. You know, it’s easy to do something in isolation once looks good, looks easy, and then you try to scale and the complexity just starts compounding and it starts compounding because you’ve done things in different ways in different places without necessarily having any of the governance and architecture that that, you know, companies were sort of born on in the IT world. And so you end up creating chaos quickly.

[00:02:38.450] – Andrew
And I think that’s that’s part of of, you know, sort of part and parcel of the issue is, is complexity.

[00:02:46.070] – Ned
Right. Just to boil that down a little bit, it reminds me of you have that shadow IT thing where our finance department went off and launched the SaaS application, and that was easy for them. But they didn’t think about the difficult bits of getting it integrated into the larger IT ecosphere. Is that kind of what you’re pointing at?

[00:03:03.710] – Andrew
Exactly. Or, you know, your whatever your e-commerce team went and built some new application on AWS without, you know, any thought about how that would get integrated as well. And, you know, it’s funny, a local bank here, you created this whole digital lab to try to do things differently. And the story is that they’ve talked about publicly. But I won’t say the bank name in case I’m wrong, but I think I’m right is that they went and built a new, like, mortgage processing application that was completely focused, like they’re doing this.

[00:03:35.960] – Andrew
Correct. They’re doing with all these new agile processes. They’re going to go interact with customers nonstop. They’re going to figure out where all the problems are in the process, and they’re going to deliver the perfect application. And they did a nice job. And then they were asked, OK, let’s go live with this. And everybody was like, are you nuts? This isn’t going to scale. Like, we haven’t thought of this. We haven’t thought of this.

[00:03:55.550] – Andrew
We hadn’t thought of this. We haven’t thought of this. What ends up happening is that stuff gets pushed to production rapidly, not in this case, but that stuff gets pushed to production rapidly. And now you’ve got all these builders, all these problems, all these missing components of the architecture, all of these problems and trying to scale it. And when you’re not actively planning, but you’re reacting, oftentimes you just make things worse because now you’re just bandaid-ing stuff, you know, as opposed to actually having a methodology of how you’re going to build scalable things in the cloud.

[00:04:25.760] – Ned
Right, so you start with what is the ideal app for the end user or the ideal application for whatever services, but you’re just trying to get the code right. You’re not necessarily concerned about all the surrounding things, like scaling it, like adding security, like integrating with the other components in your ecosystem. Right.

[00:04:44.210] – Andrew
Right. And this is you’re using newer technology, technology that there’s not institutional knowledge and experience with. It hasn’t been used in anger. And so you don’t really understand how it scales or what the best practices are.

[00:04:58.100] – Ned
I like that. Hasn’t been used in anger. Can you can you expand on that a little bit? Because I’ve heard that phrase before, but I’m curious what context you’re using that in.

[00:05:06.830] – Andrew
You know, I’ve been I don’t even know where I picked that up from. I’ve been saying it for a long time. I often also say that, like, you don’t really understand something until you’ve failed multiple times. Right. You know, and and that’s that’s part of what I mean by used in anger. You know, like you’ve you’ve this thing is not being used in the lab anymore. There’s going to be unpredictable usage patterns. There’s going to be people doing things that you weren’t expecting.

[00:05:31.460] – Andrew
It’s you know, you’ve flipped the switch on and there’s no way you could have predicted every possible fault mode, you know, that there’s just no way. And so that used in anger is really is really around that. Like, you just use the floodgates are now open.

[00:05:49.160] – Ethan
You’re going to build a new Web server, Andrew, and it’s going to have IPv6 on it. And you’re going to use cert-bot and set up, let’s encrypt and you’re going to think it’s binding correctly. You’ve tested it. It seems fine. Only you don’t have v6 at your house. And lo and behold, someone else, a security professional, let’s say testing the website says, hey, you TLS is broken. Things like this happen and you just don’t know, do you know?

[00:06:11.480] – Ned
Yeah, exactly.

[00:06:12.530] – Andrew
But it will never happen to you again.

[00:06:14.390] – Ethan
It will never happen to me again. Those are you listening. I had a bad experience anyway, but.

[00:06:22.340] – Andrew
But that’s part of it. Right. You’ve gained wisdom because of a failure. If you know before that happened, then you like thousands or tens of thousands of people who have used cert-bot. And it’s trivial and it has an Nginx plugin. And yeah, only like now I have a cert and it’s going to auto renew. Fantastic. You know, and but now you sort of know a little bit more. And now, you know.

[00:06:44.990] – Ethan
Now I know it’s a thing yet because I’m using it in anger.

[00:06:48.170] – Ned
I feel like any technical professional who’s been doing this for a while, knows that if things go smoothly the first time you missed something.

[00:06:55.200] – Ethan
Yeah, but sadly true.

[00:06:57.980] – Ned
There is just no way you did it. All right. The first time if I set up like a CI/CD pipeline and it’s all green the first time it runs. No, no. Something is absolutely wrong there and it always is. Once I start picking apart, I’m like, oh, it just skipped six of the steps. So of course, nothing errored out.

[00:07:13.760] – Andrew
Yeah, but and that’s I guess that’s what we’re getting to though, right. Because that’s the experience gained from things failing and understanding how things behave and understanding why green doesn’t always mean good, you know, and without that experience and a lot of this technology is new. So you don’t have the institutional knowledge, without that experience, then people make naive errors and now compound that to something as massive as some cloud migration. And on top of everything what the cloud providers are offering is changing underneath you.

[00:07:49.840] – Ethan
You know, is this a technology problem, Andrew, or is this a people problem?

[00:07:54.360] – Andrew
I think with like with everything, it’s a bit of both, right, and I think, look, the people problem side of this, you know, lack of coordination and cooperation historically between cloud teams and traditional infrastructure teams, there’s a people side to it that has nothing to do with technology. In some cases, we just, you know, let’s find this age old like you guys are too slow. So we’re going to do it and we’re going to do it better type world.

[00:08:20.890] – Andrew
So maybe it’s not all wound together, but usually it is. And it’s from a technology standpoint, this stuff isn’t easy. It looks easy. If you do something trivial or small, you can launch a virtual machine in a couple of clicks, whatever. That seems easy, but taking application, breaking it down, creating a cloud native version of it, deploying it all the different components you need from a from a application side, networking side, security side, expecting it’s going to scale.

[00:08:54.220] – Andrew
That’s hard. And there’s a lot of companies out there. You know, it’s funny, like the cloud was the best thing ever. And now, like Ethan, you mentioned before, just the sort of cost of, you know I think this is before we started recording. But you referring to a specific article in terms of like the cost of cloud over time. And and part of that is just OK, but now I can’t move because I’m so wedded.

[00:09:16.030] – Andrew
I’m so I’ve built everything specifically for AWS or Azure or something like that, and then in comes all of this additional technology like Kubernetes and things like that. And now, OK, so now I can really achieve my goal of loosely coupled systems that can run anywhere. But you’re just continuing to layer on more complexity to get to something that’s going to meet your requirements and the amount of people and experience and technology you need, even though there’s companies out there that will say no problem, just, you know, magic, it’s going to work, yeah it’ll work, you know, the first time, maybe the second time.

[00:09:53.380] – Andrew
But but you truly don’t understand it. And the technology is just getting more complex. And I’m not trying to say like the world is turning to chaos on the technology side. I just think that this assumption that cloud is easier than on premises is wrong. There is so much promise around cloud. I love the fact that any company, you know, like you sort of like drop the barrier to entry into markets. People can build technology rapidly. I can build brand new security architectures that are way more secure than anything I could have done on premises.

[00:10:27.160] – Andrew
That layering on all this other stuff, you know, there’s so much goodness and so much promise, but there’s just this naive view that these few guys over there can just go deploy stuff at scale.

[00:10:39.940] – Ned
I think there were like two big early fallacies of cloud adoption. The first one is that it was going to be cheaper. And the second one is that it was going to be simpler and right. Like neither of those have really borne out. And we’ve had a lot of people on the show giving their anecdotal experience around this, saying, you know, we tried to go through this and these are the roadblocks we hit. But I think what’s really interesting is you actually your company commissioned a report around cloud adoption.

[00:11:06.370] – Ned
And this is not just anecdotal, it’s actual, you know, statistics and pretty pictures and all that kind of stuff. So can you tell us a little bit about how the report was composed and some of the key findings that are within that report?

[00:11:20.800] – Andrew
So we, along with EMA, and really EMA, drove all the research and just speaking to many, many technology professionals out there, around their, their companies cloud adoption and spoke to people on both sides of it, you know, those helping to drive the adoption and those that, you know, are sort of part of the traditional team that might not have been directly involved in it and just talked about success rates, problems, what problems they might be having and the numbers were through the roof.

[00:11:49.960] – Andrew
I mean, you know, 72 percent of companies were not meeting their goals. You know, and the reasons were fairly common. A lot of it had to do. And by the way, not meeting their goals was either from cost or they’re having operational issues and outages or there were security concerns. And in so many of the cases, in a lot of that, just focused on a lack of collaboration between these two different deployment domains.

[00:12:19.300] – Andrew
What I’ll generally call on premises and in cloud and those that reported some level of success also reported a higher level of collaboration, both organizationally and technically across those two different domains. So what surprised me was just how many companies were reporting that they weren’t meeting their objectives, that these things were not going well. What didn’t surprise me, because it certainly was my sense. It’s why we commissioned the report. It was our hypothesis that collaboration had a lot of a lot to do with the failures.

[00:12:52.390] – Andrew
And so we’re sort of very. Careful not to have one of these lead the witnesses, you know, research to sort of prove the hypothesis. Instead, we asked a bunch of different questions to try to drive to the root cause. And and that just kept coming up over and over and over again.

[00:13:08.540] – Ethan
When you say failed to meet their goals in the cloud, what does that mean? They couldn’t get as many workloads moved. They couldn’t meet their cost objectives. What do we mean they failed?

[00:13:20.000] – Andrew
So it was either in time. In other words, we had a goal to get. You know, we had and we still have several customers that have like, you know, an IT goal of 80 percent move to cloud by twenty twenty four, something like that, you know, so we’re going to shut down these data centers by this date. So there’s the time dimension of failure. There’s the cost dimension for sure. You know, our expectation was we would be able to do things cheaper.

[00:13:47.210] – Andrew
It’s more expensive or it’s not cheaper. And then there’s also security and resilience, you know, reliability of the technology failures as well. And when you look at which types there were, you know, most many companies had, you know, unless my math is wrong, many companies had multiple because they were all over 50 percent.

[00:14:09.150] – Ned
Wow, we were talking about the technology before, but I think you cited the larger thing, which is like the technology is hard. I wouldn’t say the technology is easy because some people will say they’ll be like the tech easy. People are difficult. And I know the tech is hard. It’s just people are harder and people are more difficult. That seems to be one of the conclusions of the report, is a lack of collaboration between the people in the organization.

[00:14:34.150] – Ned
Did it seem to matter the size of the organization or the structure of the organization?

[00:14:40.330] – Andrew
Size Maybe, maybe not, other than, you know, the complexities, usually higher and larger organizations, but from an organization standpoint, when there were things like, you know, single owners or, you know, oftentimes you look at sort of the classic large IT organization and the CIO works for the CFO because IT’s about cost and cost containment. And the cloud teams are either working for a business unit or they work for the CTO. And so now you have these two different executives in the company driving completely different agendas with different KPIs and metrics on their success.

[00:15:20.440] – Andrew
And you saw a lot of this between security and broader IT before, where those were completely separate organizations. They both had their, you know, architecture, engineering and operations teams. And you saw a lot of combination at the architecture and engineering side. Ops still might be specialized and separate, but now with sort of like NetOps 2.0 and that sort of stuff, maybe coming back together. But you saw these sort of duplicative organizations that were competing.

[00:15:48.260] – Andrew
And so that almost drives a separate type of approach, which which, from a collaboration standpoint, just you’re starting with the wrong foot forward. I’m not saying that all technology should be under one executive at all because that doesn’t work. You know, the business units are turning into technology creators. You know, they’re not just waiting around for it to get stuff done. So I’m not saying that that everything needs to come together. I am saying sort of at that architecture level and and in some cases at the engineering level, there has to be deep collaboration.

[00:16:25.730] – Andrew
And if that requires either virtual organizations or permanent organizations in different specific areas, will that make sense.

[00:16:34.370] – Ethan
At the risk of sounding trite, I’m going to ask this question, but doesn’t DevOps solve that? And I know that’s trite because DevOps is a buzzword and so on, but what I really am getting at is if you move to that model of delivering applications, it does change your processes internally. That forces some level of collaboration so that everybody is following this deployment model is so that I just make that up. Andrew, is that actually a thing that maybe helps some organizations?

[00:17:03.680] – Andrew
I still think I believe this, which is, you know, DevOps exists because of a need for a Band-Aid between two different organizations. And ultimately, you know, those developing the applications should be developing them with CI/CD and everything. And, you know, the DevOps guys end up being dev guys or Ops guys in some end state. And maybe I’m wrong there. And, you know, I haven’t discussed this for a while, but but outside of that, Ethan, I think the issue, though, is when people think of DevOps, they think more about how are we going to deploy this thing versus a OK, so we’re going to use 10 different regions of Azure globally.

[00:17:49.340] – Andrew
How are we going to route network traffic, you know, and what do we need to do in order to do that successfully and cheaply and reliably and in the best effort, the best, you know, have the best consequences. And so I think unless that work has been done ahead of time, those organizations are automating. I guess they’re optimizing locally and nobody’s thinking globally.

[00:18:13.400] – Ethan
Well, there’s two different things there, right? There’s the infrastructure and how everything is interconnected, the platform upon which you’re delivering applications in this complex environment. But then there’s also two different ways to look at DevOps. DevOps as a separate group. That’s this tertiary thing that’s supposedly a bridge between Dev and Ops. And we’ve talked to different people there and that tends to not work so well. And then DevOps as a mindset in a practice where we are changing how we deliver infrastructure and applications, that is the one that does seem to work but is harder to do.

[00:18:46.640] – Andrew
Correct. Yeah, and when I was saying DevOps sort of its value as a Band-Aid, I was talking about the first the first part there and I mean to offend zero people when I say DevOps is a Band-Aid. But DevOps is a practice 100 percent aligned with you. And for sure. And yes, you know, if you start thinking that way across all of your infrastructure, you know, anything that needs to change or be orchestrated to successfully, you have to measure things differently.

[00:19:11.480] – Andrew
Right? Start measuring the successful push, you know, the successful changes. And what I mean by that is, you know, like we have customers that would measure, OK, there’s more API calls. Now, our goal is everything needs to be via API and now there’s way more API calls. And you’re like, okay, but are they successful? You know, you’ve now exposed APIs to nonexperts in that area and you’ve given them the keys to make changes.

[00:19:43.580] – Andrew
Are they the right changes, you know, and in how much work you’re doing, embedding your knowledge into the APIs that you’re giving them. So let me give you let me give you a real example. Like in our world, I’m going to deploy an application on a network and therefore I need a bunch of IP addresses. If you give me a API that allows me to go reserve an IP address, any IP address I want, that’s not particularly helpful because I don’t know what network I should go to.

[00:20:15.860] – Andrew
I don’t know. You know, I don’t want to I don’t want to have the last IP address in a slash 24. I might need 20 or 30 or 40. You know, I’ve got. This stuff might be Internet facing this stuff might need to communicate with this other security zone over there. I need help. You know, normally I would just fill out some helpdesk ticket or whatever, you know, something or have a conversation. Now it’s up to me.

[00:20:37.560] – Andrew
I’m going to go get this IP address. I get the wrong one. That’s potentially problematic. So how do I expose a set of APIs that allow for successful automation? And that’s a lot of what DevOps and others those functions are. Those those people are building out there is, you know, higher level APIs that allow for success because they embed in the requirements that every single technologist on the team shouldn’t have to know. But you want them all to comply with.

[00:21:13.910] – Ned
Right, you’re trying to bake that institutional knowledge into the APIs and automation that folks are interacting with so they don’t shoot themselves in the foot. They there’s protections built in for them. And maybe there’s a way around some of those controls if they really know what they’re doing. But you want to give them sane defaults.

[00:21:32.060] – Andrew
Right, yeah, 100 percent and not make them be experts in every different area, because that’s that’s not going to happen. You know, you don’t want you know, it used to be, you know, back back in my enterprise software days, we would go we would go try to sell our software. And the goal was get the business to buy in before IT finds out about this, because once IT finds out about this, you’re this is going to slow down big time.

[00:22:01.700] – Andrew
But if the business says we desperately need this, make it happen, that’s the world you want. If IT got involved too early, you got the 83 page specification on network requirements. And now now you are scrambling to figure out how you’re going to move like, you know, hundreds of gigabytes of data across like a you know, some ISDN line they have between, you know, two points or something, you know, versus the business telling them that’s got to go because we’ve got to move the data.

[00:22:30.590] – Andrew
And my point is now, as network practitioners or practitioners in any infrastructure area, we have this amazing opportunity to build something that can meet the, you know, unpredictable needs of the business at some predictable cost. And that’s where cloud becomes an amazing opportunity, because you can’t do that if you are in this world of project based I.T. where you need to know the end game before it starts, you know, and and so I guess my point with that is a lot of people also look at this as a OK, so cloud’s hard.

[00:23:08.330] – Andrew
We talked about, you know, there’s people issues, there’s technology issues. Part of those people issues are sort of the old versus new view of the world. From my perspective. It’s like a rally call, like this is exciting. You know, that we are now part of the business strategy. We are now where, you know, the technology we’re building now has real relevance to our success as a business. What do we need to do differently in that mindset? I think is critical.

[00:23:35.720] – Ned
I think one of the things I’ve seen with DevOps is sort of an expansion of the term. There was like DevOps and then there was DevSecOps and there was DevNetSecOps.

[00:23:46.520] – Andrew
Right.

[00:23:47.270] – Ned
And I think what what people were trying to get at was that need, like you’re saying, of a collaboration between multiple different departments and by just sticking them all in into one big term, it got a little ridiculous, a little out of hand, but I see what they were going for. You need all these groups involved in the early portions of the architecture so that you get it right. And that’s what they were trying to get at, even though the term I mean, it got a little out of hand.

[00:24:15.950] – Andrew
Yeah, no, 100 percent. And and a lot of the solutions you see out there for these teams are are necessary, but they’re sort of like, OK, it’s built. Now we’ve got this awesome platform for AIOps or something like that. So now we’re going to tell you if there’s a problem and we’ll potentially automatically fix it for you. But where are the tools for building it in the first place? You know, correctly and you know, in that obviously there’s tons of tools for building it.

[00:24:44.210] – Andrew
My point is for building one thing or it or that, you know, versus what my architecture should look like. I think there’s just not enough emphasis on that work up front. And I’m really sounding like a non agile type person. I built a career driving agile software development, like I believe wholeheartedly in that. I just believe that a lot of those processes, because I see, you know, McKinsey, Accenture, those guys are out there selling like agile now to the CEO level.

[00:25:16.280] – Andrew
And back when I can go visit customers, like, I see these big, broad signs everywhere about agile transformations that used to be in the software teams. Now they’re in the business teams. We’re going to be an agile organization. And here’s what that’s going to mean. And I both fundamentally believe in agile practices. I just feel like they they focus on some part of the overall process. And where they don’t work well in many cases is in that upfront architecture.

[00:25:45.530] – Andrew
And that doesn’t mean you need to go spend a year building some architecture. But if you skip through that and we’re just going to go agile, we’re going to start writing code and we’ll figure it out, because you know what? You know what? We’ll just sort of amass this architecture over time. That only works in simple cases.

[00:26:06.710] – Ethan
Well, Andrew, I want to get into some specific stuff, so we really set up the problems here dealing with cloud multi-cloud and hybrid cloud, the people challenges, how groups are organized. We’ve talked through that. So let’s dive into some technical examples. You were with Blue Cat Networks. You folks are really good at at several things, but DNS is one of them, which happens to be something near and dear to my heart. Going back to when I was the DNS host master at an ISP 20 years ago or so.

[00:26:34.370] – Ethan
So I’ve always been keeping up with DNS and its and its problems because the meme it’s always DNS. It exists for a reason, right?

[00:26:44.020] – Andrew
It sure does.

[00:26:45.140] – Ethan
So we’ve had for a lot of years a goal of a unified DNS is how I would think of it as a person primarily responsible for DNS and organization. And DNS is popping up all over the place increasingly. Now you’ve got cloud with its own DNS perhaps, and kubernetes needing CoreDNS for service discovery and your own internal domains.

[00:27:06.110] – Ethan
And then the marketing folks have bought 50 sound alike domains for brand protection and so on. So how do you how do you get this under one operational umbrella, one one approach to deal with this? I think when we were prepping the show Andrew, you described it as this refracturing of DNS, how do we get a handle on this?

[00:27:28.040] – Andrew
Yeah, that’s really what it was. I mean, you know, our is, our goal and the goal of our customers were quite aligned in the past. There should be a single control plane for DNS inside my company because this stuff is critical. And the more islands there are, the more hands in the pot. The more people trying to deploy this, the more likely is there’s going to be outages. That’s what we were trying to solve when we bought your technology.

[00:27:51.140] – Andrew
And now, yes, I’m deploying to Azure, AWS, Google, wherever I’m deploying. There’s different flavors of DNS for different tiers that you’re it’s going to fracture again. And you have people that can just go create zones, you know, and and we see it fracture in horrible ways, you know, like I can’t reach that DNS server from the cloud that’s on premises. So I’m basically going to create an etc host file, but I won’t do that.

[00:28:19.880] – Andrew
Instead, I’ll just recreate that same zone in Route 53 and hard code the answers. I’m going to go I’m going to I know the DNS, the IP address of DNS Server on premises. I’m going to use that and nobody knows I’m using that. And that thing’s going out of service or we’re taking that thing down for service. And I’m bound to just that one IP address and nobody knows it’s going to affect my application. So it’s the same stuff.

[00:28:42.740] – Andrew
And you’ve got this group of people when there’s a DNS issue, their goal, you know, what they’re measured on is is successful DNS resolution. And so we changed our focus from trying to ensure that, you know, there was our customers. And again, we had the same goal, which was put it all in BlueCat, to a different goal, which is every DNS query should be resolvable that that’s, you know, is supposed to resolve quickly and the tools are needed.

[00:29:14.510] – Andrew
Visibility is needed. We we need to be, you know, in order to ensure that we need to know what’s going on. So how can we create resolvers that can do the magic for our customers and resolve those records wherever they might be, discover where they might have, you know, new zones might have been created and basically create a federated view of DNS and the tools to navigate that federated view. And we’ve been focusing on that very heavily.

[00:29:41.090] – Andrew
So assume the world will be fractured and solve the problem now. And the problem is the same problem that’s always been. I need to be able to resolve these queries. I’ve just made it harder. But I can’t use the you know, the the old answer. There’s I need new technology to to approach that.

[00:30:00.690] – Ethan
Do we mean fractured that there are multiple copies of the same zone with different answers because people went off and did their thing? Do we mean fractured because of just different orgs and people with different layers of responsibility? Maybe it’s both of those things.

[00:30:16.180] – Andrew
Well, both of those things. And sometimes, by the way, the same zone with different answers managed by completely different people is purposeful. You know, for instance, I might be deploying an application in multiple virtual private clouds. That application uses the same names. It’s going to get different answers because I might be using different IP addresses, for instance. So, yeah, I’m going to have different copies of that same internal zone. And that’s fine.

[00:30:47.310] – Andrew
You know, it’s 100 percent fine. And so there’s bad cases of that, which is like the one I mentioned where I’m just going to recreate a zone because it’s basically, you know, I convince myself I’m not hard coding it with an etc hosts file, but I’ve recreated the zone somewhere. That’s the bad side of that. But but but still. Yeah, it’s by the fracturing. I mean, there’s multiple people that are permitted to and are deploying authoritative DNS servers inside the organization or utilizing cloud providers’ authoritative services.

[00:31:20.100] – Ethan
And you’re not saying that this fracturing of DNS is necessarily a bad, horrible thing to be avoided. It is an operational reality to be managed.

[00:31:30.060] – Andrew
It should be avoided in the before state because it leads to problems. It always does because, you know, New Zone is launched somewhere and somebody forgets to add a forwarder to some Windows DNS server in Australia and problems ensue you know, so now what we’re not saying is go crazy. What what? Because that just makes it harder maintain. What we’re saying is there are real use cases where the cloud provider’s DNS should be used. I’m using some of their cloud native services that are doing, you know, health based DNS answers for their services.

[00:32:14.010] – Andrew
So to go tell a team, no, you can’t use that. You know, that’s anti-DNS, you know, like I’d be a hypocrite, you know, like cool that’s what I want. That’s what I want as a cloud technologist. So so I’m going to use those services. That’s the reality. And we think there should be some governance around how they’re used because, you know, again, you can establish some pretty poor practices right away and so have some governance on how they’re used in each of the different cloud providers.

[00:32:46.640] – Andrew
Have you know, it looks like when you read their Web pages, when you look at their APIs, but they have very different capabilities on the DNS side and they change over time and so have some governance, have some understanding. And then what we’re doing is bringing order to that potential chaos, even if it’s well, you know, it’s it’s not chaos. It was structured appropriately. You still have you still have these issues. You know, we have the other example, Ethan of same zone as like Microsoft for its private end points uses.

[00:33:21.650] – Andrew
Like you have to use their zones. You can create a CNAME to it, but you have to use, like whatever it is, private dot linked dot database dot Microsoft dot net or something like that. And then we’ll have customers. We’re trying to resolve those from the data center. And they run into a problem because there’s multiple subscriptions in Azure with different zones that are the same name and you’re like, oh my God, like that, that’s not for DNS to solve.

[00:33:51.430] – Andrew
You know, like that’s a bad idea. What you’ve done. Those names should stay in those subscriptions and shouldn’t not be used outside the subscriptions, but we’re already using them. So we’re focusing on allowing them, oddly enough, to use those sorts of things internally and and using DNS to facilitate finding the right one. But again, they’re to start with that as an example, is a very bad practice.

[00:34:14.560] – Ned
Right, to split into three things that jump out to me. There’s the initial architecture. So you’re trying to develop an initial architecture that takes into account the fact that you’re going to have this fractured DNS.

[00:34:27.370] – Andrew
Right.

[00:34:28.030] – Ned
Then there’s the day to day management and operations of it and then there’s effectively monitoring it. So you’re actually aware of all the different things that are going on. Maybe we can start with the architecture point and what should you be thinking about from an overall architecture standpoint if you have the chance to actually start designing some of this stuff out instead of just reacting to the environment, you have?

[00:34:53.380] – Andrew
The things you want to make sure a part of the architecture are a clear delineation between zones that or DNS records that are private to a segmented network, for instance, you know, like this stuff is never going to bleed out. And anything that needs to be shared or bled out and number one, if the line between those two things aren’t clear, if I start using things in one of these segmented networks, that really should only be used within it, that’s when like real operational issues. So at the very highest, at the broadest level, this is it’s all private DNS.

[00:35:34.160] – Andrew
Right? We’re not talking about public DNS on the Internet, but there’s like private private DNS, you know, private to my tenant and then external to my tenant. And I think much like we used to think about firewalls and networking like this, I’m not saying we need to say this IP address is allowed to get to these three DNS records, but there should be some declarative understanding of the DNS dependencies between these different areas so that there’s, you know, an understanding of the broader resolution paths in the organization.

[00:36:06.860] – Andrew
Think about a resolution path, how if I need this cloud tenant needs to resolve records in these zones. How do I make sure that happens efficiently, rapidly and always and that they’re getting the right answer. And and so we should be looking at the DNS dependencies between these different deployment domains as well. Those are the core parts of the architecture.

[00:36:29.250] – Ethan
So architecturally with DNS, the way you traditionally would solve this, you’re going to have a authoritative top of the domain hierarchy. You can have some authoritative server that then delegates with NS records off to subdomains. Is that what we’re talking about? And I’m asking it contextually here, Andrew. We said earlier you can have the same domain that depending on where it lives, could be serving up two different answers. And you don’t solve that problem with NS records, you would have to solve that problem with DHCP serving up different name servers for different hosts, depending on where they’re coming from, something like that.

[00:37:03.290] – Ethan
So I’m thinking through this going, OK, how do you actually have a single source of DNS truth?

[00:37:09.530] – Andrew
Right. There are some zones that might live in the cloud that you’re going to delegate off with your NS records. And this is just, you know, cloud company dot com. Right. And cloud company dot com is going to be in that authority over there, DNS was made for that. That’s how it all works. Fantastic. You certainly can’t do that in the case where you know, that you mentioned, but also what you end up having in the cloud domains as people just create zones that have no delegation point.

[00:37:37.640] – Andrew
And you can’t in many of the clouds delegate from those somewhere else. So they just simply create the zone. You know, they go in and create cloud company dot com with no delegation at all. And that starts with the band-aids because now people on premises need forwarders and forwarders break like, you know, it’s it’s a rule that’s made to fail at some point because something changes and somebody forgets that there’s some forwarding rule somewhere. You’re not using DNS’ delegation anymore.

[00:38:05.650] – Andrew
But again, the reality is that’s going to happen, so that’s part of the magic, and I don’t want to make this a BlueCat commercial, but that’s the pain, Ethan. You nailed it. And that’s part of what we’re doing with we’ve got this idea of name spaces where we can actually go navigate different authorities that might have the same zone without delegation and without these sort of complicated world of of forwarding and. And by the way, forwarding on its own isn’t complicated.

[00:38:34.060] – Andrew
What’s complicated is five thousand three hundred twenty nine different forwarding rules across different DNS servers that nobody knows how they got there or if they’re important anymore. That’s fragility. So, yes, we’re doing that and we’re trying to do that from a you know, I’ve got this big focus here and I think as everybody should, you know, DHCP and DNS are our core to service discovery. DHCP is like one of the first service discovery protocols out there, you know.

[00:39:00.220] – Andrew
Hello, do you have some configuration information? Is this how every, you know, DHCP connected host starts the day, right? So in the world of, like managing this stuff, though, it’s never been about service discovery, it’s been about up front configuration. Assume the end state you want, configure for that. And so what we’re also investing in is, OK, so we can learn a lot from these clouds. So why should I have to configure my system to speak to something that’s already configured?

[00:39:32.530] – Andrew
And so a lot of what we’re starting to do is just turn that around, saying, let’s go discover where the stuff is. Let’s go discover the best way to get to it. Let’s find the blind spots. And if somebody needs to resolve in a blind spot, let’s go put something there so we can resolve there and sort of changing, like trying to tackle this like you would building cloud technology, as opposed to trying to wedge in the old way of doing things into the newer domain.

[00:39:59.990] – Andrew
That was sort of abstract, but but the point is, since the stuff changes rapidly, you can’t expect somebody to keep up with configuration.

[00:40:08.870] – Ethan
Let’s pull this together with a troubleshooting example. I know one of the things I know you said you didn’t want this to be a BlueCat commercial. But, you know, Andrew, it’s OK to talk about BlueCat, you guys sponsored the show. And we want to know we want to know this. And so a piece of the puzzle here to help bring this together would be the recursion path you mentioned. When you begin to stray off of what is built into DNS and you have to do I’ll just say magic to make DNS work the way you want when you need to be able to troubleshoot that when it doesn’t work the way you expect.

[00:40:38.390] – Ethan
How do I deal with troubleshooting that recursion path since it can be complex?

[00:40:43.820] – Andrew
You know, it starts with some level of visibility, which most companies don’t have today. So you might be having the problem, but Ned’s not having the problem. Works perfectly for Ned, you know, and so me as IT guy can’t go check from my desk. Doesn’t matter. You know, this is a regional issue, this is some part of the DNS arch. So what was different. How do I compare what was different. I need the data and the data is a huge part of this.

[00:41:11.060] – Andrew
I need to see what DNS queries came off of your machine versus Ned’s and the way people tend to tap this data and get it today. It’s too far down the stream to have any attribution to the end user unless you’re going you’re going across multiple nodes. That’s what happens with DNS and each node jams in a unique, you know, query ID or message ID and that’s forwarded to the next node. And so I can try to collect all the stuff from all of my servers and then use advanced techniques to correlate it and say, oh, that must be your query.

[00:41:40.700] – Andrew
But if you two query at the same time, I don’t know whose query was on the next box. You know, yours was served from cache, Ned’s actually went through, hard to tell. And so a big part of what we do is just make sure that’s visible so I can see exactly what happened with each query. We actually jam stuff into the query. I mean, this is the private domain. None the stuff leaks to the to public. So we jam stuff into the query so that we can trace and understand that resolution path as well.

[00:42:08.540] – Ethan
Jam stuff into the query as in you take a DNS query that is coming through, add some metadata to make it easier to do tracing.

[00:42:15.560] – Andrew
Yeah. And we add metadata on the way out as well. So as an administrator, I don’t necessarily need to wait for this stuff to be logged somewhere. I can actually just do a dig and inspect the query and see something, you know, understand this was resolved the wrong way. And now I now I like there’s the thread. I see the difference, you know, and that’s that’s a critical part of the process is just visibility, you know, and we talked about one of the problems with visibility is just the you know, these queries get go through a resolution path and they might hop across two or three different servers and there might be cached here, there or somewhere else.

[00:42:54.260] – Andrew
You might have a NAT gateway from Cisco or Juniper, somebody that that’s actually changing the DNS queries, you know, and that’s the thing that’s actually causing the problems. Like, it’s so difficult to tell. And so the other side of visibility that’s hard is there’s just so much noise, just just a ridiculous amount of noise. You know, the number of queries like we sort of track it on a on a user driven machine goes up over time.

[00:43:25.010] – Andrew
And I don’t have exact percentages. But but we’re doing the analysis now, and it goes up over time, not because the user is doing more stuff, but because, you know, Google is changing its lookahead algorithms, like when you start typing stuff, you know, and so no need to not query the word baby, followed by 83 different other words.

[00:43:45.590] – Ethan
Right.

[00:43:46.130] – Andrew
But but that streamed off of his machine because he typed baby for whatever reason Ned into Google, you know.

[00:43:54.110] – Andrew
And so you see this. You can’t just, you know, inspect the stuff and assume, you know, again, sorry, there’s just not enough noise in the line that looking for anything interesting and meaningful becomes harder if I just see this firehose of data. And so we also try to figure out how to make the data relevant to the use case.

[00:44:15.680] – Ned
It’s my own fault for being such a big Justin Bieber fan. I think that’s on me. Yeah, we can we can pin that firmly on me. I think it’s what you said is really important because we started with the architecture, which makes sense. And then to call back to something we talked about earlier, using it in anger, that’s when you get to the operational reality and you need that visibility and monitoring to see what happens when you use it in anger and then resolve the issues you didn’t think would crop up when you developed the architecture.

[00:44:47.030] – Andrew
Yep. You know, one hundred percent, you know, and also there’s different DNS clients in the piece that we completely don’t control. Like, you know, what’s the DNS client that sort of comes with windows? Or iOS or macOS, or CoreDNS or whatever the case, they all have their own esoteric, wonky behavior on top of everything. And so, you know, like we have customers who, queries are working from their kubernetes domain if they’re UDP, but not TCP.

[00:45:16.790] – Andrew
And you’re just like, wait, why? Because the server is accepting UDP and TCP. So what’s different? With CoreDNS, when it’s DNS, you know, UDP or TCP, why would this fail, you know, and so sometimes they’re really esoteric because you’re, you know, these clients that we’re sort of used to the Windows client, the Mac client, you know, sort of comes back or you see like your Linux boxes are doing 10 to 100 times the number of queries, then your user driven devices and you’re like, what’s going on with that?

[00:45:44.800] – Andrew
And, you know, most Linux distributions don’t ship with any sort of DNS caching at all, that sort of changing with this resolved thing. And that’s a whole another discussion of complexity. But but regardless. So is that a good thing? You know, and regardless, yeah, you have to start with seeing the data, but seeing the data in a way that makes sense. Like I can’t just jump into a haystack trying to figure this stuff out.

[00:46:11.390] – Ethan
We haven’t even talked about clients with a Web browser doing DoH built in skipping the OS layer completely.

[00:46:18.440] – Andrew
Yeah, 100 percent, which is another major issue inside and outside of the enterprise, which is this, you know, let’s get rid of one of the control planes anyway so that, again, that’s a whole nother discussion. But but this I don’t want to.

[00:46:33.860] – Ethan
It was unfair for me to even bring it up, but I just couldn’t help it. It seemed like we needed to at least make the point for this to be a complete description of the problem that we’re facing into the DNS these days.

[00:46:43.100] – Andrew
One hundred percent. And I think there’s an under appreciation of the effort that goes into making sure this protocol that when it’s working is dial tone. Historically, you know, like nobody knows. I meet all of these people that happen to work for one of my customers. And I’m like I use me every day. They’re like, really? You know, how would they know? You know? But but if we’re having lots of problems there, they might be like, oh, you’re your BlueCat.

[00:47:09.980] – Andrew
You know, I heard somebody screaming, but, you know, so but that’s part of the cool stuff about cloud, by the way, because all of a sudden, DNS is becoming way more than dial tone to more people than just, you know, small group of dedicated professionals. There’s this article by some dude from Spotify, this blog years ago called In Defense of Boring Technology. And it went on around how Spotify was using DNS for service discovery versus Zookeeper or something and the guy ended by saying, by the way, right now, this meets our requirements. But there’s things we need to do that we can’t do a DNS. So at some point this might not make sense anymore. But it was it’s part of this general under appreciation of technology protocols, services that have always worked fine and can do way more. But technologists tend to jump to the new thing and just sort of skip over the boring technology. And we obviously don’t think DNS is boring, but it’s nice to see it sort of front and center again.

[00:48:09.680] – Ethan
Andrew, this has been a fantastic discussion. We started out near the top of the show expressing the challenges with different IT groups as they move into cloud and different styles of cloud environments. We talked a bit about your report and then got into the whole specific examples of just DNS, something we take for granted that is so key, and then how that’s become fractured, complex, difficult to troubleshoot and fraught with challenges specifically. So Andrew if people want to find more information out about maybe that report, is that publicly available?

[00:48:41.120] – Ethan
Can they dig into that?

[00:48:42.500] – Andrew
Absolutely we made it available at BlueCat networks dot com slash D 2 C.

[00:48:48.880] – Ethan
D2C like like Day Two Cloud, I see what you did there so fancy. OK, so you can get that report, dig into their research. And again, if you missed the research on the report, it wasn’t like this is a report that we commissioned so that, you know, to buy Blue Cat, that wasn’t exactly what was in the report. This is something with a lot of statistics that’s worth your time about cloud cloud adoption. What different folks are running into that I think a lot of you that are out there listening, you’re going to see yourself in this report a bit and maybe understand better how to change your approach with some of your different options are. Andrew, are public you on the Internet, someone could reach out and, I don’t know, Twitter or something like that and ask you questions.

[00:49:29.500] – Andrew
I’m at awertkin.

[00:49:30.730] – Ethan
Awertkin, very good. Straightforward enough. Well, thanks to BlueCat for appearing on Day Two Cloud and being a sponsor today, virtual high fives to you out there for tuning in. If you have suggestions for future shows, maybe you want to want us to dig into DNS a little bit more. I don’t know. Maybe you do, though. I don’t know. We’d love to hear whatever your suggestions are hit Ned or I up on Twitter.

[00:49:50.890] – Ethan
We are both monitoring the at Day Two Cloud show Twitter account or you can fill out the form of Ned’s Fancy Web site, Ned in the cloud at dot com and if you like engineering oriented shows like this one, because I know you do visit packet pusher’s dot net subscribe or just search for packet pusher’s in your podcatcher, you’ll find our entire lineup of shows. All of our podcast newsletters and websites are on the subscribe page. If you want to dig in even beyond the podcast, it’s all nerdy content designed for your professional career development.

[00:50:18.880] – Ethan
And until then, just remember, cloud is what happens while it is making other plans.

Episode 101