Search
Follow me:
Listen on:

Day Two Cloud 135: Infrastructure As Code Should Foster Infrastructure As Collaboration

Episode 135

Play episode

On today’s Day Two Cloud we examine why Infrastructure as Code (IaC) is about more than the just the tools that enable it. Guest Rob Hirschfeld puts forth the notion that while the tools are there for a reason, those tools have to serve a purpose.

He argues that IaC is about trying to build automation that’s code-like, but it’s too easy for individuals to focus on their specific coding language or domain and forget that IaC should support collaboration, re-use, and support collaboration, re-use, and efficient operations. You have to look at how the tools foster efficiency and support teams, or you miss the whole point.

There’s also a bit of conversation here about culture: that is, how technology and infrastructure will affect culture and vice-versa. Rob reached out to us after hearing episode 127 on avoiding IaC pitfalls. Rob is CEO of RackN.

We discuss:

  • Ensuring the tools you use or build enforce behaviors you want to see in operations
  • Eliminating one-off solutions to problems
  • How IaC is about collaboration
  • IaC, CI/CD, and pipelines
  • Day two challenges of IaC
  • More

Takeaways:

  1. Embrace the heterogeneity
  2. Consider the IaC pipeline
  3. Collaboration is key

Sponsor: StrongDM

StrongDM is secure infrastructure access for the modern stack. StrongDM proxies connections between your infrastructure and Sysadmins, giving your IT team auditable, policy-driven, IaC-configurable access to whatever they need, wherever they are. Find out more at StrongDM.com/packetpushers.

Tech Bytes: Singtel

Stay tuned for a sponsored Tech Bytes podcast with Singtel about common misconceptions customers have about connecting their private networks to the public cloud. For instance, SD-WAN might seem like a simple option: just spin up an SD-WAN end point in your VPC and off you go. That’s fine for a single region in a single country, but things get complicated fast when you’re talking about hundreds of sites across different countries.

Show Links:

Day Two Cloud 127: Avoiding Infrastructure As Code (IaC) Pitfalls – Packet Pushers

@zehicle – Rob Hirschfeld on Twitter

RackN.com

Cloud2030 Podcast

Transcript:

 

[00:00:01.150] – Ethan
Sponsor StrongDM is secure infrastructure access for the modern stack. StrongDM proxies connections between your infrastructure and Sysadmins, giving your IT team auditable, policy driven IaC configurable access to whatever they need, wherever they are. Find out more@strongdm.com slash PacketPushers.

[00:00:26.130] – Ned
Welcome to Day Two Cloud. Today we are expanding on a concept that we kind of dug into a few episodes ago with Tim Davis infrastructure as code pitfalls trends. But this time we got a different perspective from Rob Hirschfeld. He’s the CEO of RackN, and he reached out to us and said, hey, you didn’t talk about teams and collaboration. I feel like that’s a key component of what goes into a successful infrastructure as code practice. What jumped out to you in the conversation, Ethan?

[00:00:56.550] – Ethan
Well, it’s not just another culture conversation. I mean, it is. But we get into more how you structure infrastructure delivery around the idea of pipelines, and we build on the concept of DRY. Don’t repeat yourself that we talked about with Tim and take it a step further. So, sure, it’s a good principle. The dry idea is a big deal when you’re writing code, but it can be a big deal for how you deliver infrastructure as code more broadly. And we talk about that, and that really captured my imagination.

[00:01:25.110] – Ned
Yeah. We really focus on the fact that it is code that we’re trying to approach here and using some software development techniques, but you can’t lose sight of the infrastructure as well. So enjoy this conversation with Rob Hirschfeld, CEO of RackN. Well, Rob Hirschfeld, welcome to Day Two Cloud. You are the CEO of RackN, and you are deeply invested in the world of infrastructure and infrastructure as code. And you reached out to us, man, you said, hey, I loved Episode 127 with Tim Davis, but I feel like you’re a little too focused on the tools. Right.

[00:01:55.660] – Ned
And the tools aren’t everything. It’s about more. There’s a bigger holistic picture. So tell me where I’m wrong. What did we miss?

[00:02:05.370] – Rob
The tools are there for a reason, and the tools are beautiful and they’re shiny. And we actually have really good tools, but they have to be serving a purpose. Right. Especially for infrastructure as code tools. They’re trying to build automation. That’s codelike. Right. That’s the whole purpose for infrastructure is code. And it’s easy to get like, oh, I’m a Go programmer or Python, and I’m going to plant my flag on the moon for that language and forget that. It’s not about the language. It’s actually about how it fosters collaboration and reuse and modules and how it fits in your environment and how teams work together. Right. People aren’t coding today as sole practitioners. They’re coding as team activity. And so you have to look at how the tools that we’re using foster and support the teams, or you’ve missed the whole dimension. You’ve almost looked at it backwards.

[00:03:00.370] – Ned
Right. Because it’s not just tools. We’re talking about DevOps. We’re talking about Infrastructure as code. We’re talking about people and process and like culture. In terms of culture, if we’re looking at an organization, do the tools inform the culture? Does the culture inform the tools, or is there like a Conway’s Law sort of thing going on here?

[00:03:24.570] – Rob
I love that question. I would actually add a third piece because I think the infrastructure that you’re working with, it’s almost a three legged stool from that perspective. And so I think that the tools definitely have an impact with that. And then so does the way the team is structured and how you organize the team and then how you interact with the infrastructure is all an important component for that. And so what you need to think through here is actually that they influence each other and how they work. And one of the things that really stood out in my mind listening to that, the Tim Davis podcast, which I thought was excellent, was that we weren’t talking about ways in which the tools reinforced team behavior. And that, to me, becomes part of the thing. We think about this a lot as we’re building tools with RackN is that are the tools that we’re using and the platforms we’re building and using as we build those tools and platforms reinforce what we want to see people doing in operations. And I’ll give you a really simple example. A lot of times we’ve talked about this where people go in and modify a server.

[00:04:39.210] – Rob
Right. It’s very bad for somebody to log in and fix a server by hand, and that’s great. And I think we’ve gotten people mostly out of doing that. But the same would be true if you modify a script to do a one off operation or one off action, you basically also modified something by hand. It’s not repeatable and reusable and then run it. What you really want to be doing is using tools that make it easy and reinforce the idea of I’m going to fix things in ways that create long term durable patterns, not one off solutions. And that feedback loop. What I like to think of as a success cycle, like if you do work like that, then other people can pick it up, other people can add to it, and then they can improve that, and then you get this. It’s a very code developer like cycle where you’re constantly improving the libraries and the components. And so we have to look at how we’re interacting with our infrastructure, code, tooling and platforms to create that virtuous cycle.

[00:05:39.610] – Ethan
Well, wait a minute. Are you talking about just workflows like a workflow and a nice UI that I interact with that other people can too. And because we all see it and we’re thinking about things the same way we all get along and build on each other, that sort of a thing or something else.

[00:05:57.450] – Rob
So the way I started thinking about it is very much like a CI CD pipeline. But for infrastructure, what we’re literally calling an infrastructure pipeline, where the work that you’re doing can be connected to the next team’s work and the next team’s work and the next team’s work. So in CIDC pipelines, they start off like one or two teams working together, like I do build and I package it. And hey, I’ve made progress. But the CICD pipelines that we’re seeing evolve, actually expand, right all the way into production and work with the Ops teams, they had security teams. They might actually have observability components baked into them. So as those processes get more people involved, which we want, then that is actually part of having these systems connect together.

[00:06:46.830] – Ethan
So it’s the interaction of various disciplines within the IT groups that support an application delivery stack and making it easier for those different groups to interact with one another in this concept of infrastructure pipeline, because it’s not just if you’re a server oriented person standing up a server, it’s also all the things that go along with that, the networking and the storage and the security and then developers. I’m assuming we’re not talking exclusive of developers. Rob.

[00:07:18.190] – Rob
No, this is the fun thing about building pipelines. It’s always been the vision of DevOps is to be across organizations, why people pound on the table. When you say DevOps engineer, the purpose of DevOps is to connect all these groups together. It’s not to delegate the role. So, yeah, it’s exactly what we’re talking about. There’s another dimension with this, too, though. Even the tools need to play along better together than they do.

[00:07:45.170] – Ned
Do you have an example of that where two tools are not working well together, or you don’t have to call anybody out if you don’t want to?

[00:07:53.950] – Rob
Well, I will in that there’s this well known pattern that people have dubbed Terrible, which is Terraform and Ansible.

[00:08:03.190] – Ned
I haven’t heard that!

[00:08:04.300] – Rob
Together as a script.

[00:08:07.750] – Ethan
I had not heard that.

[00:08:09.610] – Rob
You hadn’t? Oh, goodness. And the reason why we will talk about it as Terrible is everybody’s doing it differently. It’s a Snowflake. And Terraform is designed to have its own state file and its own data set, and Ansible is designed to have its own state file and its own data set. And the problem this is where collaboration is so important in these conversations, because it’s not just can my people get along and collaborate, can my tools get along and collaborate? And what we’ve been doing, because we’re so used to the silo thinking is we haven’t said, how do I take a tool and make sure that it can pull data from upstream in its use case and send data downstream in its use case? If I did that, if my tools did that well, then it would also make them easier to connect together, and I wouldn’t be stretching a tool into a use case. It wasn’t designed to do.

[00:09:06.580] – Ned
And I don’t want to make this a tool show because that was like, the whole point is not to make this show about tools. So maybe I think I want to go a different direction with this. When you’re thinking about an ideal setup for a team that’s practicing good infrastructure as code or even an organization. Who’s on an individual team, like, what sort of contributors do you expect to be on a team? And then do you have separate teams that just have an expertise they loan to you? What do you think of as your ideal structure, or is there no ideal structure?

[00:09:38.650] – Rob
Some of that is cultural fit, and this is one of those ones I bounce on in my career. Right. I do typically think that the idea of a full stack engineer is toxic in our industry.

[00:09:52.150] – Ned
Expand on that because a lot of people want to be a full stack engineer. They want to hire one. So why do you say that’s toxic?

[00:09:59.350] – Rob
It’s toxic because what we end up doing is it’s a focus on individuals, not on teams. And it’s a focus on an individual who has enough skill to walk something through every component of what the development process needs to be. Now, I totally think it’s important for every technical person to understand other parts of the stack. But the idea that I’m imagining this huge backpack. Right, you’re going to turn that person into the Sherpa for the environment and have them be responsible for every step of this process and then feel like they can’t ask for help or feel like it’s all on them and they have to understand each piece. It’s taking something that’s super complex and burdening somebody with trying to understand it right now. It would be fine if we said, hey, Valve did a really nice job with this on the T shaped engineers where they’re like, look, we want people with broad knowledge, but some deep specialties. And so I think full stack engineer causes people to get distracted and then not realize that they do need help. So you had asked me, so do I think that we should let people silo and our experience deeply in organizations is that the silos are not bad.

[00:11:22.150] – Rob
The networking team and the storage team and the compute team and the security team, those specialties are real, and there’s deep knowledge that’s necessary to do a good job with those things, those different areas of aspects, those areas of expertise, whether they’re in their cloud or on premises or edge. But the thing that gets toxic in those relationships is when they get territorial and don’t collaborate.

[00:11:49.820] – Ned
Right.

[00:11:50.100] – Rob
So this is where that becomes the real challenge. On the other side, that drives people to full stack engineers. It’s like, I don’t want to deal with all these other teams. They slow me down.

[00:12:00.490] – Ned
To a certain degree. You’re thinking about how do I reduce the friction between the silos but still maintain that high level of expertise where I do have the T at the top of the T that you were describing that’s general knowledge across a bunch of disciplines, enough that you can talk to if I’m a storage guy and I want to talk to the networking Gal, I know enough to speak her language, but not enough to do her job. And then I have, I guess, the middle of the T. That’s my deep expertise in the storage. So that’s kind of what you’re advocating, right?

[00:12:32.560] – Rob
That’s exactly what we’re seeing. Works very well when you enhance the collaboration. Now we’re back to collaboration, right? When you enhance the collaboration between those expertise and have a way for them to work together, that really does change the ball game.

[00:12:49.060] – Ethan
So what’s the tool rule here? I’m sorry, what is the tool role here? Is it to keep the pipeline not so siloed, so that you see the delivery stack as a whole integrated thing, of which your team, or maybe you as an individual contributor, have this expertise and handle this part of it. But you can see, oh, the security team is going to engage here and then add their stuff. But I can see it. It’s not hidden from me.

[00:13:17.030] – Rob
Right. And each team has tools that work really well for them. We don’t want to take tools out of people’s hands if they do what they’re doing and they’re natural and they fit, especially if they’ve evolved over time. That’s what I mean by having excellent tools. But if the tools can’t fit together, that’s the pipelining concept. If the tools don’t fit together. So that you’re exactly right. We want to be able to say, hey, security team, the work that you want to do, I can put into a pipeline. I know it’s going to work. If it’s not working, I can call you up and say, fix it or improve it, but I want to be able to use that work over and over again. And I don’t want to have to call the security team every time I want to run their part of the process through my environment.

[00:14:03.040] – Ethan
The pipeline becomes the component of delivery that the humans rally around. They rally around the pipeline. They might be using their own individual tools still, but they are unified and get some flow to the delivery based on the pipeline. So as long as the tools can interface with the pipeline, we’re okay. Is that your argument?

[00:14:28.330] – Rob
That’s exactly right. It’s a coordination. The pipeline provides the coordination. It doesn’t enforce everybody. You don’t want to take agency away from individual teams. What you do want to do is you actually want to increase the agency of the individual teams and then minimize the disruption. This is a paradox here from a complexity perspective that I had to get my head around. Which is it’s okay to have a lot of tools or a lot of platforms or a lot of environments? Right. That is actually normal. The idea that you’re going to wake up one day and cut a whole bunch of stuff out of your existing environment is a fallacy. So the thing that you need to do is embrace that. It is complex and heterogeneous and stop yearning for if only those other teams would stop using their misguided tools and platforms and clouds. Right. And wake up to the way I do it and everything would be so much better. That’s actually not a helpful attitude. It’s much better to say, I’m going to reduce the complexity of my organization by allowing my work to flow smoothly through the organization and connect to the teams that have to consume it.

[00:15:45.820] – Rob
If I have nice boundaries between my work, it’s decoupled from that perspective, and then I can coordinate those actions. It actually changed the way I thought about what we had to do to manage complexity.

[00:15:57.050] – Ned
Now, one thing that I’ve seen in organizations, they’ve started to embrace this idea of having a platform team, a team that provides a platform for the rest of the teams, and to a certain degree, that could provide a bounded and homogenous environment all the other teams are using. So do you support that? It almost sounds like what you’re saying is, no, you got to embrace the chaos and the fact that everybody’s going to be using something different.

[00:16:22.970] – Rob
I think platform teams are a really good idea, and I’m watching the customers we interact with are forming them. The challenge is it feels like IT in the 90s.

[00:16:35.080] – Ned
Okay.

[00:16:35.900] – Rob
Which is that in the 90s we had PCs, we went from mainframes to PCs, and that basically. And we did the same thing in cloud and cloud governance. That’s actually where we are. It’s what we’re facing. So let me go back to the 90s analogy. What happened is people started buying PCs and saying, I’m just going to run Excel, Word and whatever on my PC and get around all this IT governance stuff. And we did that for a while. And then all of a sudden we had viruses and worms and support issues and all this stuff. And so IT showed up and started controlling what you could do on your PC and governing it. Right. And that was really hard, but it was a necessary swing in how those things work. So I think the platform teams are required. I think that we need to figure out how to make platform teams work. And we also have to have the platform teams acknowledge that they’re walking into a situation where there is a lot of governance challenge where the teams have already built their tools and chosen their clouds and things like that.

[00:17:41.310] – Rob
So the platform team can’t be the team of “No”, like IT used to be. They have to be enabling something that goes further, and they also have to have a mandate from management to say, yeah, you actually do have to spend the time collaborating with the next teams over because there will be impacts right? If you take a team that’s not doing a security scan as part of their pipeline and say, hey, now you have to start doing security scan, it’s going to slow them up. It’s going to take time and investment, and that is going to be an impact.

[00:18:20.070] – Ned
Right they’re…

[00:18:21.050] – Rob
A beneficial impact. But if you’re not used to it, it’s a negative one.

[00:18:26.080] – Ned
Yeah, there’s definitely a challenge for any application team that’s trying to get the new version of the application with some whiz bang feature out and they don’t want to be slowed down by the fact that, oh, now we have to add the security scan and what if it comes back with a whole list of things? Now I have to investigate all those vulnerabilities and my boss is saying, I need this application out tomorrow to meet whatever the made up deadline is with the promise they made to a client, I guess. How do you balance that in the organization? Is it a culture thing? Does it come from the top down?

[00:18:58.890] – Rob
I actually think it’s better when it comes the other direction, when it comes organically upwards. This is the other piece to this. And this is where infrastructure as code, to me, becomes a really powerful story in the platform team mission. Because right now, the way automation is structured is like the old going back to all my 90s analogy, the Visual Basic era, where people were just like hacking their own code and writing their own thing. And even if they were using code they got from somewhere else, they couldn’t put it back. It wasn’t reusable, it wasn’t modular. And so I think part of the platform team’s job is to understand their ultimate goal is to build the pipelines and connect everything together. But on individual teams basis, they’re actually fostering infrastructure as code reuse and modularity. So every team benefits if they give up the toil of writing automation that they don’t want to maintain.

[00:19:58.070] – Ned
Right?

[00:20:00.750] – Rob
That’s a really big deal. And one of the things that we spend a lot of time asking is why is automation isn’t more modular? One of the formation stories for RackN is we used to ship clouds from the factories at Dell and we did this MVP style. So literally we would ship a rack of servers that we built in the lab and then our CTO would fly at the same time and go help install. Our CTO now, he was on the team at Dell, but what would happen is every time we would deploy those servers, it would take an engineer building the rack up to the spec on site. And then this is where we got really frustrated. We’d come back home, be celebrating over beers another cloud installed Yay, and we get a notification that there was a patch or a change or something that went on and that was fine. We’d fix it for that customer we just installed and the but is huge. But it was impossible to take that fix or change our improvement and repeat it for the last five customers that we installed before that.

[00:21:13.990] – Ned
Right.

[00:21:15.970] – Rob
We had this problem where we could keep improving the automation that we would deploy day one. But it was so hard to reliably take anything we’d improved and fix a site we’d already built. Right. That just drove us nuts. It still drives me nuts. And I watch people write automation code that only works for their team. That’s the problem here. That’s what the platform teams can bring in and be welcomed in with open arms. It’s like, hey, here’s automation that you can use, and if you improve it, other teams will get the benefit.

[00:21:54.730] – Ned
It’s relatively easy to write code or a script or Infrastructure’s code that matches your individual use case because you know exactly what the inputs should be and what you’re trying to build. When you try to abstract that out to meet more than just your use case, maybe your company’s use case or something even more generalized than that. That’s hard. And people have stuff they’re trying to do. I mean, you founded a whole company to make these abstractions because people weren’t going to do that on their own. Right. So how does the platform team go about building these modular abstractions that work across the entire organization?

[00:22:39.020] – Rob
So part of this is using tools that reinforce those practices, some of it. So I’ll tell you how we solve it, and people can do it, take their lessons and figure it out from there. There’s about three or four things that are really key for this. One of them is having well defined inputs and understanding. They’re small units of work. Very important. Right. This is coding. This is so funny. This is coding 101, and we just haven’t applied coding practices. This is where my head explodes with we call it infrastructure as code, and then we do get YAML and we call it done. It’s not. The goal here is to make it code like. And that means we have modular, small components. We have ways to define variables in ways that are understood. One of the challenges is we have a whole bunch of stuff that you can define variables Willy nilly. And most languages don’t do that, especially highly collaborative languages. They actually want class and typed things so that if somebody misuses a variable, they get feedback, the system stops them. And then the other thing that we did that it’s taken a long time to get it right.

[00:23:59.210] – Rob
But the benefits are amazing is we have made the code that goes into these systems immutable. And when you’re distributing these small modules of work, compiles the wrong word, but you put them into a bundle. And that bundle, when it shows up in the system, is an immutable artifact. So you don’t get this. Oh, I modified two lines of code to fix my case you actually go back, fix the bundle, re upload the unit of work so the variables and the tasks and the parameters and the templates and all that stuff gets set together as a unit, like a coding module, and then it can be distributed and redistributed and versioned. Version control is really important in this.

[00:24:41.510] – Rob
It ends up feeling very code like

[00:24:45.290] – Ethan
[AD] We pause the podcast for a couple of minutes to introduce sponsors. StrongDM’s Secure Infrastructure Access platform. And if those words are meaningless, StrongDM goes like this. You know how managing servers, network gear, cloud VPCs, databases and so on. It’s this horrifying mix of credentials that you saved in Putty and in super secure spreadsheets and SSH keys on thumb drives and that one doc in SharePoint. You can never remember where it is. It sucks, right? StrongDM makes all that nasty mess go away. Install the client on your workstation and authenticate, policy syncs and you get a list of infrastructure that you can hit when you fire up a session. The client tunnels to the StrongDM gateway and the gateway is the middleman. It’s a proxy architecture. So the client hits the gateway and the gateway hits the stuff you’re trying to manage. But it’s not just a simple proxy, it is a secure gateway. The StrongDM admin configures the gateway to control what resources users can access. The gateway also observes the connections and logs who is doing what, database queries and kubectl commands, etc.

[00:25:47.760] – Ethan
And that should make all the security folks happy. Life with StrongDM means you can reduce the volume of credentials you’re tracking. If you’re the human managing everyone’s infrastructure access, you get better control over the infrastructure management plane. You can simplify firewall policy. You can centrally revoke someone’s access to everything they had access to with just a click. StrongDM invites you to 100% doubt this ad and go sign up for a no BS demo. Do that at StrongDM dot com slash packetpushers. They suggested we say no BS and if you review their website, that is kind of their whole attitude. They solve a problem you have and they want you to demo their solution and prove to yourself it will work. Strongdm dot com slash packet pushers and join other companies like Peloton, SoFi, YXT and Chime. Strongdm dot com slash packet pushers. And now back to the podcast. [/AD] [00:26:45.890] – Ethan
It does end up feeling very code like Rob. And there’s a big question here for me. Because of that, we’re talking about developer centric skills being applied to infrastructure so that we can deliver infrastructure as code in a proper way. Not a myopic way, just my thing. I can build a script that does X.

[00:27:06.470] – Ethan
So what is the core expertise of a platform team? Is it that the platform team is made up of a bunch of devs who are dedicated to creating an infrastructure as code delivery platform?

[00:27:18.100] – Rob
Oh boy, you’re hitting on one of the other things that I see that actually makes me a little nervous because Operations is not typically. I think there’s a lot of development done in Dev, but it’s not done as code necessarily. And by code, I mean compiled code. So there’s a couple of products out there, and there’s some major Kubernetes projects that take the whole deployment of Kubernetes and compile it as a Go component and then distribute it. And the challenge with that is that Operations tends to need high transparency in how things operate, and also a degree of flexibility to run a script or add a script or modify something. And so the balance is you do want it to be codelike, but at the same time, you don’t want to create a scenario in which an operator has to feel like they have to learn how to program in Go in order to fix a problem or even understand actually, it’s not even fixed understand a problem. So a lot of the stuff that we’ve been doing, it ends up being Bash, right. Or PowerShell. So if you can look at an individual task and it’s still using Bash, it’s still scripty.

[00:28:44.030] – Rob
But the answer was not, hey, code everything in Python or Go and make it Python and Go. There’s a reason why Ansible and Terraform aren’t Chef. Actually, Chef and Puppet ended up being a lot very Ruby and got some resistance from those perspectives. So now you don’t want the operators to become programmers to run infrastructure. And I actually think that if you have to compile your code to fix your infrastructure, I think you’ve potentially locked yourself into a risky pattern. There’s a lot of times when I’ll find something that’s not working right or we hit a new operating system and I’m like, oh, wait a second, this operating system, I can’t just do a yum. Amazon. I’ll be very specific. Amazon’s Linux, which is fine otherwise and mostly uses Yum every once in a while, does not use Yum. And so you have to put in a if family this add this extra command, and then you’re on your Merry way. And the fact that I can do that, update template, rebundle it, and then send it back into the system for testing is huge. It’s absolutely huge. Right. It doesn’t require me recompiling a whole bunch of code in the process.

[00:30:04.100] – Rob
I can make the changes just like you would expect to if you’re an operator and move on.

[00:30:09.180] – Ned
I think part of the idea behind compiled code is the idea that it runs faster. Right.

[00:30:14.180] – Ned
Because it’s compiled, it doesn’t have to go through an interpreter that needs to parse it and then turn it into whatever is actually going to execute it. So there’s definitely a perception that compiled code runs faster, but that’s not always the case. Right. I think that’s just a general perception. And that if you’re a real programmer, you compile. If you don’t compile, you’re not a real programmer, you’re just a scripter.

[00:30:40.130] – Rob
Well, and I think one of the other things that makes infrastructure programming so different than development programming is that infrastructure programming is reactive. There’s so much in infrastructure programming that is dealing with the complexities of the systems that you get and accepting there’s a Zen to it. Right? You can say, I wish I didn’t have to worry about this. And in development, you’re sort of in control of your whole environment. In infrastructure programming, you might get into a place where you’re like, okay, Amazon’s APIs are implemented this way, and Google’s are that way, and Azure’s are this way. They’re totally different. And you don’t get to rage quit because you can’t eliminate those differences. You have to Zen in and accept it. And then you have to code around the heterogeneity. But this is where the platform teams add a lot of benefit. You can come back and say, all right, these differences are networking differences or compute team differences. And if you can get them to have a shared place where they can work or share segments in the pipeline where they can hand off that work section to section, then all of a sudden it transforms.

[00:31:53.580] – Rob
You pulling your hair out because I’ve got the wrong infrastructure. I don’t know enough to build the network correctly. You can say, hey, network team, I need you to help build this thing in this way and then reuse that code.

[00:32:07.390] – Ned
Okay? So it sounds like in terms of staffing up a platform team, what you’re really looking for are people that are comfortable with software development concepts, but they have a deep understanding of how infrastructure works and the challenges around it. And they know who to go talk to when they’re trying to put something together. And then they’re responsible for building these templates and these modules and these reusable components and maybe even an example pipeline that the other line of business teams are going to then leverage for their applications.

[00:32:41.120] – Rob
That’s one of the things that SRE movement, if you will. I really liked it was the concept of toil and reducing toil. Right. And I think part of what you just described to me can be framed in a Toil perspective. Now, you have to have the bandwidth or the capability and the investment to be collaborative in it. But if you can go to a team and say, you know what, instead of you having to worry about how the networking or storage or computer set up, we can give you a standard library. Now, this is key. You’re not taking away their ability to do that work. What you’ve done is you’ve packaged somebody else’s expertise in a module that they can include in their pipeline.

[00:33:28.260] – Ned
Right?

[00:33:30.230] – Rob
In a developer, you don’t want to go to a development pipeline and say, developer. You now have to be a security expert. When you write your code, what you’d really like to do is say, we’re going to put auditing tools that check your code for you so that you can relax about that thing a little bit and then everybody gets to work in that. But yeah, if you’re not showing up with stuff that works and this has been part of the challenge of building up these pipelines, you have to have enough for platform teams. They have to have enough working things that it is a benefit to the team when they show up that you’re taking work off their plate. Right.

[00:34:08.680] – Ned
You’re reducing their toil.

[00:34:11.750] – Rob
Yes, you were reducing their toil.

[00:34:13.450] – Ned
And do you find that platform teams are usually built out of existing teams that are excelling at what they’re trying to do with an organization? Like, if you just have one particular line of business app who has a team that’s just rocking and rolling, they built all this cool stuff, you go, you two. I’m going to borrow you and put you on a platform team. Is that the Genesis that you see?

[00:34:38.550] – Rob
Not as much directly, although I think there are benefits to that. That can have its own downsides unless the people want to do that. If those individuals are already evangelizing this approach, then sure, it makes perfect sense. The way I see the platform teams evolving is actually out of shared use of tools. So some of what I’ve seen happening with platform teams is a response to, God, we’re using a lot of Terraform and nobody’s using it consistently and we aren’t sure how to secure it. And you get the same thing for Ansible you get actually the same thing for a cloud platform. Like, hey, we’re using Amazon and the bills are crazy and we’re using it all these different ways. So what I’ve seen the platform teams really, they’re reactive to. We have a consistent use of something, but we’re using it inconsistently, right. Or ubiquitous use, but inconsistent use. And the platform team sort of arise out of the I can’t cross train people, I can’t secure it. The security team shows up and says, okay, nobody’s consistent. So I can’t spend the time auditing all this stuff because it’s taking me too long to figure out what each person does and if they fix it.

[00:36:02.710] – Rob
And this is a key thing. This is actually why I think companies need to look at platform teams much more in a much more accelerated way than they are, is if you go to fix something for a team and they’re using a Bespoke, even if you understand, you’re like, oh, I can read this. It’s great. I know the tool, but it’s bespoke the time it takes and the risk to fix something so that it’s portable across different infrastructures is really high. And that’s bad, right? You don’t want your security team to show up or your cloud team or your hardware team to show up and then break the teams that are running on top of them. And so that consistency allows you to then have what we see happening is a Dev test, prod automation process.

[00:36:51.120] – Ned
Okay. It reminds me a lot of manufacturing basically where you want to have standard parts that different lines of business can use. Right?

[00:37:01.040] – Ned
I mean, this is like manufacturing 101. If I’m an automobile manufacturer and I produce 18 different types of cars, there are going to be common components in all of those cars. And I don’t need to redesign the widget that makes the window go up 18 times. I just need to make a window widget that works consistently across 18 vehicles. And that’s what the platform team is responsible for. Make the window go up and down.

[00:37:27.750] – Ethan
Going back to the episode that we recorded with Tim Ned, don’t repeat yourself effectively.

[00:37:32.690] – Ned
Right.

[00:37:33.750] – Ethan
It’s the dry principle. Yeah.

[00:37:35.970] – Rob
Dry is so critical, although I think you do want to take it that one step further. It’s like, don’t repeat yourself across teams. So Tim’s comment was very much like, hey, I want my Terraform plan or my Ansible thing to be dry. What I’m saying is you want your pipelines to be dry.

[00:37:55.680] – Ned
Yeah. Take it that one step above to the thing that’s your orchestrator your automation platform that also has to be consistent and repeatable across all these different lines of business.

[00:38:08.200] – Rob
And then you can actually start doing modular things like where CI CD pipelines have a lot of reusable pieces. You want to have each team running a pipeline doesn’t mean they’re running the same pipeline. That wouldn’t make any sense, but they should be running pipelines that are composed of consistent pieces. And if you can do that, this is back to the tools reinforcing themselves. You want to have a high degree of benefit for I’ve made this installer work for the scenarios of all these teams, because everybody’s now using the team, that one process. And if they can reuse that process, I’ll give you a deep example from what RackN does. We spend a lot of time on RAID and BIOS configuration for customers, but most of our customers have teams that do RAID and BIOS configuration, and there are teams that resist that, giving up that skill for it, and they keep investing in it. And it’s not adding a lot of business value, but it’s a known part of the process. The challenge becomes, how do you eliminate the toil of having to reinvent that process? Because the way we do it is different than the team is doing it, and that creates churn and there’s angst.

[00:39:21.330] – Rob
But at the end of the day, it’s like, well, and this is where the value comes in. If we improve it because we find something with another customer or we work with a vendor and use their new tools, then nobody on the company has to figure that out. It’s been figured out, and that translates across the board into an acceleration. Right. You have to think of it, the platform team has the ability to significantly accelerate teams all across their organization if they can get people reusing code patterns. And if they fixed it could be, oh yeah, I fixed the log for JBOG in my Java deployer. Great. Everybody now gets the benefit of using that. A lot of companies are doing this team by team by team because they don’t have the process.

[00:40:08.350] – Ned
Right. So it requires a certain level of discipline within the organization, across the teams and some sort of leadership to get the platform team up and running and actually get all the other teams on board with using what the platform team is creating. And I think that only happens, like you said, when the platform team is actually creating something of value that’s useful to all those teams.

[00:40:33.630] – Rob
Although I do think there’s an element, if you’re looking at the executive buy in and to me, the IAC, the infrastructure as code mantra here is I want my operations teams to look more like the development teams. And if my development teams were rewriting libraries that they could just pull in and use, I’d be upset. And we should be having that expectation. We’re deep, deep into this cloud and automation thing. The fact that we don’t have as much reuse out of automation pieces across teams or across the industry, that really troubles me. We have good tool reuse, but we don’t reuse something simple like how an install is library is updated.

[00:41:19.590] – Ned
Right. Well, Rob, this has been like a real fascinating and interesting conversation. I feel like we could probably go on for another 2 hours or something spinning on this, but unfortunately we are starting to run ahead of time. What are some key takeaways or main points that listeners should come away with?

[00:41:38.100] – Rob
Oh boy, there are there’s a lot of topics that we could spend whole podcasts on going deeper. So maybe I’ll try and hit those as some of these takeaways. I do think that we need to think through this concept of infrastructure as code pipelines and start figuring out how to connect things together. But you don’t need to connect an end to end pipeline. You can just connect two things together. Right. In a team, two tools together, two teams can collaborate better. It doesn’t have to be all or nothing that’s really important. I think that when you look at how you’re using the tools and what the tools are doing, review what your process is and think about if it’s creating forward benefit and encouraging. Right.

[00:42:25.580] – Rob
Sometimes there’s more work to do it, but it’s so important to create collaboration and have your tools reinforce the collaboration that I think that’s worth people sort of thinking through. And if you look at some of the tools you’re using and ask that simple question, it might have you regauge that from that perspective. And then the third one to me is embrace that things are heterogeneous and that complexity in itself is an expression of what we actually have to cope with. It’s not bad or evil, right. You don’t want to run around and just strip the complexity out of your systems. It serves a purpose. Sometimes it serves a really important purpose. You might not even understand yet, but you have to be Zen about it. You have to come in and say, all right, I have complexity in my systems. Let me look at how do I manage it better so I’m not burned when something unexpected happens or decoupled from other complex systems. So those are three things that I think are really important as you build these systems. And if you do those things and think about it that way, I think it will give you a much more resilient, robust and calmer day with infrastructure, right.

[00:43:41.670] – Ned
If folks want to hear more from you, I know you’ve got a few different ways people can either reach out or listen in. So if you want to plug those, now would be the time to do it.

[00:43:50.820] – Rob
I’d be happy to. I’m vehicle Z-E-H-I-C-L-E on most platforms, very active on Twitter. And a lot of these things we’ve been talking about, you can download and try and experiment with what I’m talking about in our product called Digital Rebar. If you visit Rackn.com, a lot of what we do is it’s software. And we encourage people to download and try it and get first hand experience. So I encourage people to check that out. We actually are building a lot of content on infrastructure pipelines also. So if you’re interested more about that, that’s definitely a good place to go for that. Also, I have one more. I’m giving you an outtake potential, but I’ve also been running a open roundtable session called Cloud 2030, and that has been really remarkable. We’ve been running over 18 months. Twice a week we have a DevOps discussion, and then we also have a strategy discussion, and people just show up and they talk about sort of big issues that are confronting our industry, where things are going. We sort of put an agenda of topics. We love to talk about Edge, but we talk about governance and standards and things like that. And I would encourage people, if they want to be part of a hallway track conversation, jump in and join that they can find out more about that.

[00:45:19.600] – Rob
The site is the 2030 Cloud.

[00:45:22.080] – Ned
Rob, thank you so much for being a guest today on day two Cloud. And hey, listeners out there, stay tuned. After this. There is going to be a Tech Byte from Singtel. We’re going to be talking about cloud networking and troubleshooting performance issues. It should be a wild and crazy time.

[00:45:40.610] – Ned
Welcome to the Tech Bytes portion of our episode. We’re in a six part series with Singtel about cloud networking. That is how to make your existing wide area network communicate with cloud services in an effective way that maybe your legacy WAN isn’t able to today is part three of six, and we’re chatting with Mark Seabrook, global solutions manager at Singtel, regarding strange ideas and misconceptions customers have about connecting to the cloud. Mark, welcome back. One prevalent solution to connect private networks to cloud networks is SDWAN. We, in fact, talked about SDWAN in the previous episode, and it seems like it should be really straightforward. Right. Plumb an SDWAN router into your cloud VPC, it joins the tunnel mesh and it all just works, right?

[00:46:28.490] – Mark
Yeah. So a little bit of a misconception with some customers. When people get into SDWAN, sure. It looks very plug and play. And it is when you start off and you’ve got maybe only 10-20 sites, but you get up into the 500 sites, the thousand sites, very quickly, you add in some cloud connectivity, some Web security. You can get very granular. You can have tunnels upon tunnels upon tunnels. And if you don’t plan it correctly, you let it get out of control. It can become problematic.

[00:47:14.320] – Ned
Right. I’m just envisioning in my head, like ten points, all connecting together to some clouds. And I can kind of hold that in my head. And then you said 500 sites. And I’m like, no, that’s just a rubber band ball. And I have no idea what’s going on. But let’s not overstate the complexity. Right. If I’m in a single region, in a single country, like you said, 10-20 sites, SDWAN, that’s pretty functional out of the box. Right.

[00:47:40.120] – Mark
One of the other big things that you really need to plan and look at with an SD network is regions around the world. So, for example, if you’re a customer that’s only got sites in the United States, it’s totally free. The Internet is free. There’s no limitations. You’ve got ample bandwidth. If you’re doing an SD network across the globe, other part of the world where things are regulated, you have to really think deeply about this.

[00:48:11.520] – Ethan
Okay. You just brought up a point here that it’s just hitting home. You’re talking about the connectivity from America being free. Not free as in dollars. Free as in. There’s no restrictions. If you’re on the Internet and connecting from one point in the US to another point in the US, you’re not thinking about there’s some bottleneck where someone’s monitoring my traffic and going to throw some stuff away. There’s no government agency. Well, we can get into conspiracy theory. I would be super fun. There’s no restrictions on where I can push traffic. But you’re saying depending on where I am routing internationally, it gets complicated if I’m trying to push traffic into you brought up the example of China. Let’s say that’s the point you’re making.

[00:48:54.080] – Mark
Yeah, absolutely. I mean, the most famous one is China. Everybody sort of knows in the networking world knows about the great firewall of China. But it’s not just the firewall. It’s the way Internet traffic is routed through the three main providers within China. And it’s very much a province to province, and there are choke points. So we get a lot of customers who are in China, they’re on SDWAN, but they’re pointing their corporate has them pointing to an AWS or an Azure target outside of China, even in the States, and that’s very problematic. So you really have to think around that when you’re designing from a plain piece of paper.

[00:49:41.490] – Ned
That is very interesting to me because I think of China as the Great Firewall. There’s just like a big border around it. And all Internet traffic that is outbound gets filtered or inbound gets filtered. But I never thought of from province to province because that’s a very different experience than what you see in the United States. So it’s literally if something’s coming from one province to another that might get filtered.

[00:50:03.930] – Mark
Yeah, filtered. And it’s also a choke point. So if you’re sending a packet from New Jersey to California, there’s a billion ways it can get there in the States. Obviously, just based on BGP, within China, for example, it’s a lot more regulated. So even traffic within China, there are a lot of choke points before you can even get out of China through what we commonly refer to as the Great Firewall Mark.

[00:50:35.770] – Ethan
Another challenge here, I think, is even though we’re in the Asia region trying to move traffic from, say, Hong Kong to China, that depending it could go all the way across the Pacific to the West Coast of the US and then back again.

[00:50:47.820] – Mark
Sure, absolutely. So if you’re using a regular vanilla DIA Internet, if you did a trace route from Hong Kong to a target in China, most likely it’s going to go via California or West Coast hub. So we kind of get around that by using our IP transit, where we have direct peerings with the three main Chinese telcos. So we can actually determine the route. So it’s a more direct pathway from Hong Kong into China, for example.

[00:51:27.870] – Ned
I was going to say going out to the West Coast, San Jose, that seems suboptimal in terms of routing. I know routing isn’t always the most optimal route, but that seems not great, not great performance.

[00:51:40.290] – Mark
And that’s another thing with SDWAN. If you’re going pure Internet, there are a lot of what we just call suboptimal situations around the world. You can get traffic and it’s wrapped up in a tunnel and it will flow from India to the States back to London, for example.

[00:52:01.230] – Ned
Okay, interesting topologies. So what approach would you take if you’re designing an SDWAN solution? Would you go with a very regional approach for that, or what’s your advice to avoid these kind of pitfalls?

[00:52:16.950] – Mark
Yeah, absolutely. So a lot of our customers, especially our big customers, with say 1000 sites, we will keep them in a regional topology. So what that basically means, let’s just pick on the States. All of the sites in the US would point back to two Equinix data centers in an active failover, you’d have 100 gig ring between those two Equinix back to their private data center, and then from the actual in country data centers and private hubs, we will link them with point to point or a large MPLS to their other regions. So, for example, you might have a customer that divides their network up into the States, Europe saying Germany, Middle East, and Asia.

[00:53:11.250] – Ethan
Yeah. What you’re highlighting here is even though SDWAN with the tunnel overlay abstracts, whatever the transport is underneath it away. If you want to get optimal performance and depending on what the endpoints are that you’re trying to connect, you may need to do some I guess we could describe it Mark as over the top routing engineering that is going to make sure those endpoints are communicating in an optimal way, that you can’t just point your SDWAN tunnel at the default gateway and say, good luck, little packet. You’re going to be fine. Because it may not be fine without that engineering that is going to regionally optimize traffic flow.

[00:53:50.410] – Mark
Sure, absolutely. The very last thing you want is to go to an SDWAN topology, gain all the benefits of SDWAN, and then have your Internet breakout going six times around the world before it hits the cloud target. So you want to keep everything kind of in region. As far as cloud connectivity is concerned, that’s always been our philosophy.

[00:54:15.180] – Ned
Okay. So if you’re taking advantage of some cloud services you want to hook into AWS or Microsoft Azure, you would have those cloud connect points that I think we talked about in a previous tech byte that you provide. Those would be in those Equinix data centers that you’ve set up as the two active failover sites for a region. Is that the design we’re talking about?

[00:54:36.270] – Mark
Yes and no. So, yes, a lot of the traffic will go be routed to their data centers, and then it will hit the cloud through our SD connect product, which takes them in through a direct connection. A lot of other applications and this is the beauty of SDWAN. A lot of the other applications will just be inspected on the first packet at the site, go straight to the Internet breakout, and point towards Zscaler node, and then from Zscaler, then into the cloud just through an Internet gateway.

[00:55:11.850] – Ned
Okay. So if I’m using Microsoft 365 at one of those branch sites, that’s the path that’s going to take. It doesn’t need to go over an express route or anything. You can just use the public Internet.

[00:55:21.810] – Mark
Sure. The rule of thumb that I’ve seen with a lot of Fortune 100 is your Office 365 kind of low security, general Office type tools. They’re going to go for an Internet breakout. We’re still going to put them through Z Scalar to scrub them.

[00:55:42.740] – Ned
Sure.

[00:55:43.080] – Mark
But they’re going to go straight over the Internet to the most optimal cloud gateway in that territory. There’s other applications where through regulations, they’ll be pushed through a tunnel back to the Equinix hubs, maybe scrubbed through a corporate set of firewalls there, and then pushed into the cloud via an SD connect gateway.

[00:56:07.650] – Ned
Okay. Yeah. And I know a lot of those SaaS applications like Office 365, they will figure out what the best end point is for you based on the DNS queries. So they don’t want you mucking around with that.

[00:56:21.180] – Mark
Sure. Absolutely. I mean, even, for example, Silver Peak, they will push out an updated list of the most optimal cloud targets over the Internet to all of the sites. So it’s kind of very automated.

[00:56:35.790] – Ned
Got you. So the main takeaway for me here is that you can’t just slap an SDWAN into your networking and hope it works. There has to be some amount of planning behind what you’re doing.

[00:56:47.190] – Mark
Yeah, absolutely. I think at least on a global scale. And when you’re getting up into like 1000 sites, you really have to have a clean design. You really have to design it out on paper. Lots of drawings, lots of Visios. If you just put it all together and install it and people in one region build certain stuff, and people, IT staff in another region build other stuff, and they don’t all match together. You can get some real wishes.

[00:57:18.880] – Ned
The packet that never finds a home.

[00:57:21.100] – Mark
Right?

[00:57:22.590] – Ned
Cool. Well, if folks want to hear more about your thoughts on SDWAN, how can they reach out to you on the Internet?

[00:57:29.130] – Mark
Sure. Yeah. Just hit me up on LinkedIn under my name, and I’d love to talk to anybody.

[00:57:34.550] – Ned
All right. Well, Mark Seabrook from Singtel, thank you so much for joining us today. And hey, listeners, thank you for listening into this tech byte. This was just part three of a six part series, so we’re going to hear more on building cloud ready networks with Singtel in upcoming episodes. Part four will be in a couple of weeks, and we’ll be tackling some real world customer scenarios so you can learn from their experience while building your own cloud ready network.

[00:58:00.990] – Ned
Thanks to our guests for appearing on day two Cloud. And hey, virtual high fives to you for tuning in and sticking all the way to the end. If you have suggestions for future shows, we really do want to hear about them. You can hit us up on Twitter. It’s at day Two Cloud show. Or I do have a contact form on my website. It’s Nedinthecloud.com. You can reach me through there and just whatever’s in your brain that you want to share, you could be a guest on the show. Even imagine that. How crazy would that be? Did you know that Packet Pushers has a weekly newsletter? It’s true. It’s called Human Infrastructure Magazine, and it’s loaded with the best stuff we found on the Internet, plus our own feature articles and commentary.

[00:58:42.350] – Ned
It is free and it does not suck. You can get the next issue via PacketPushers. Net slash newsletter. Until then. Just remember, cloud is what happens while IT is making other plans.

More from this show

Day Two Cloud 153: IaC With GPPL Or DSL? IDK

On Day Two Cloud we’ve had a lot of conversations about using infrastructure as code. We’ve looked at solutions like Ansible, Terraform, the AWS CDK, and Pulumi. Which begs the question, which IaC solution should you learn? A Domain Specific Language...

Episode 135