Follow me:
Listen on:

Day Two Cloud 125: Scanning Infrastructure-as-Code For Security Issues

Episode 125

Play episode

It’s always better to catch misconfigurations and vulnerabilities earlier in your pipeline rather than later. That’s especially true for cloud services where a simple configuration error can expose sensitive assets to the entire Internet.

On today’s Day Two Cloud podcast we discuss how to incorporate security checks early in your Infrastructure-as-Code (IaC) workflows to reduce risk. Our guest is Christophe Tafani-Dereeper, Cloud Security Engineer at Nexthink.

We discuss:

  • What shift-left means in software development
  • How DevSecOps fits into IaC practices
  • Common cloud security risks
  • Using static scans to spot misconfigurations
  • Tools available to help
  • Digging into Terraform examples
  • More


  1. Try to minimize the noise, focus on what matters to you
  2. Using IaC is a good opportunity to find misconfigurations before it gets to production
  3. Shift left, but also start left!

Show Links:

@christophetd – Christope on Twitter

Christophe on LinkedIn

Christophhe’s  Blog

Shifting Cloud Security Left — Scanning Infrastructure as Code for Security Issues – Christophe’s blog

Scanning Infrastructure As Code for Security Flaws – IaC Scanning DevSlop

NSA Releases Guidance on Mitigating Cloud Vulnerabilities – Cybersecurity & Infrastructure Security Agency

Starting Left rather than Shifting Left? – OWASP (PDF)

Introducing the State of Open Source Terraform Security Report – BridgeCrew

Infrastructure drifts aren’t like Pokemons, you can’t catch ’em all – driftctl

Shifting Cloud Security Left: Scanning Infrastructure as Code for Security Issues – OWASP DevSlop via YouTube


[00:00:06.770] – Ned
Welcome to Day Two Cloud. Today, we are going to be talking about scanning infrastructure as code. Why? It’s critical if you’re trying to create a secure environment in the cloud. And one thing that really jumped out to me is it’s not just scanning at once. It’s scanning it multiple times at the right time. What jumped out to you, Ethan?

[00:00:25.190] – Ethan
Well, scanning infrastructure of code. Maybe some people think we’re talking about scanning containers or virtual machines. No, we’re talking about the code that would stand that stuff up. So, like, TerraForm, we get into the TerraForm stuff specifically, like looking at your plans and so on. Are you about to create something that’s horrifyingly insecure and Christophe really gets into it because of this tremendous blog post, we’re going to discuss Ned, where he reviews a whole bunch of tools that help us with this.

[00:00:49.870] – Ned
Yes. Our guest today is Christophe Tafani-Dereper. He is a cloud security engineer. So enjoy this episode with Christophe. Christophe, welcome to Day Two Cloud. Let’s start with some introductions. Can you tell us a little bit about who you are and what you do?

[00:01:06.750] – Christophe
My name is Christophe. I’m a French guy, which we can figure out from my terrible French accent. I’m living in Switzerland, and in the past year, I’ve been doing a lot of different things from software development, a bit of operations when I work in the stock. And now I’m working as a cloud security engineer in a Swiss software company.

[00:01:27.840] – Ned
Okay. So your focus has really been at least recently on security. And you have some development background.

[00:01:35.080] – Christophe
Yes, I did some Java software engineering, and I still love a lot programming. I just long ago, a few months ago before that, you cannot really call yourself cloud security engineer if you don’t code and go. So now it’s good.

[00:01:51.270] – Ned
Okay. Yeah. Go is on my list of things to study, and it’s one of those things like, I know I need to learn it. I just don’t have a great use case for it, but it sounds like this might be security. Might be a good use case for it. The context for our conversation today is all around applying security to infrastructure as code. Before we do that, I want to talk about two terms that you brought up in a blog post, and one is shifting left. The other one is Dev DevOps.

[00:02:21.290] – Ned
So what does shift left mean in the context of software development?

[00:02:27.340] – Christophe
Yeah, sure. And if we talk about shift left and devsecups, maybe we’ll have to mention blockchain and machine learning as well. So generally there’s a lot of buzz around the shift left and the cops. But generally, I think shift is really about trying to get from an old model where you have people that design, implement, test and operate some code, and then you have security that comes on at the end. And I think Shift is really around trying to get security right from the beginning trying to shorten the feedback loops that you get.

[00:02:59.720] – Christophe
So instead of having all the waterfall kind of deployment that takes months and then you have the security that comes in, you try to have security included in the design and then in the testing and at every phase of the software lifecycle. This is a bit abstract, but probably we’ll get into more depth later. And also it’s something that has a lot of hype, as I said. But actually, it’s nothing new. And in the context of automotive industry or in other things, it has been there for decades, and probably we wouldn’t have cars today if shift wasn’t apply to these industries, and even for software engineering.

[00:03:39.410] – Christophe
I think the first mention of Shift that I found was more than 20 years ago. So there’s a lot of kind of buzz around it. It’s something that is very useful. But let’s not forget that it’s not new, and probably we have a lot to learn from some of our industries as well.

[00:03:56.420] – Ned
Okay. So the idea is to move security earlier on in the process. How does that apply to infrastructure, automation and infrastructure AKS code?

[00:04:05.110] – Christophe

[00:04:05.360] – Christophe
So when we do cloud it’s because we have a chance to have all our infrastructure defined ascode which is something that we might not have when we don’t use it. If we go directly to the AWS console and we create our instances manually, we don’t have that. The idea is really to be able to look at our definition of infrastructure, which is code and be able to flag security bad practices or security flow even better than that, which is nice, because it gives us a very quick feedback that something is bad and we can detect that even before it goes to production.

[00:04:38.230] – Ned

[00:04:38.590] – Christophe

[00:04:38.780] – Ned
So catching the issues in the code well before it makes it out the gate and someone can exploit it in the production environment, that makes sense.

[00:04:46.820] – Christophe
Right. And it’s a bit similar to what you would do with static code analysis. Actually, you don’t want to only find out about your application level viabilities once they’re being exploited. You also want to try to scan your code to find these vulnerabilities based on your Java or Go code. And it’s the same approach, but just for infrastructure, that’s good.

[00:05:07.450] – Ned

[00:05:08.230] – Ned
Now infrastructure in the cloud is vastly different than infrastructure on premises, the way you deploy it and the services that are available. What are we trying to protect against in cloud infrastructure that would be different than traditional on premises security?

[00:05:24.490] – Christophe
Great question. And actually, it’s a question that we should ask ourselves every time we do security for anything. Otherwise, it’s very easy to secure something but doesn’t need to be secure or to secure the wrong part of the system. Right. Right. So we don’t have a lot of very good numbers about that. But there’s an interesting report from the NSA of early 2020 where basically they look at what kind of vulnerabilities they found in the cloud. And the conclusion is basically most of it is due to misconfigurations and to poor access control, which can be classified also as a misconfiguration.

[00:05:58.950] – Christophe
Here what we see is that most security breaches in the cloud are due to things that could have been avoided pretty easily. So you talk a lot about public Esprit bucket because of insecure defaults in AWS, but it’s also maybe things that are exposed to the Internet and that shouldn’t be and that sort of stuff.

[00:06:18.970] – Ethan
So we talked about the primary cause of security breaches being that misconfiguration, someone who kind of doesn’t know what they’re doing or just took the fault or something like that, leaving holes open. Are there other major security breaches or causes of security breaches? We should be thinking about Christophe.

[00:06:36.730] – Christophe
So first, I would like to shift the blame from individuals to more of the companies and the way that the interfaces are designed. If I go to the AWS console and I create a public extra bucket, it should be very hard for me to do it. And historically, AWS and maybe some overcrowd providers. But there’s been more visibility for AWS made it basically very easy to scrub your configuration or to have Insecure default. And I think it’s really about that as opposed to individual people.

[00:07:07.050] – Ethan
Wait, you’re pushing back on AWS defaults as being not secure enough.

[00:07:12.870] – Christophe
Generally, I just don’t want to get too phonical, but let’s get it.

[00:07:23.530] – Ned
That’s all right.

[00:07:25.150] – Ethan
Well, this is actually important because the thing we want to bring up here is a point of knowledge. That is, if I am a cloud practitioner standing up AWS infrastructure, it should be in my head that there are some default security posture that isn’t secure enough, and I should be thinking about that. So I don’t mean to make it political, Christophe. It’s more of just a practical thing for operators.

[00:07:53.390] – Christophe
Yeah, sure. There’s a lot of way that when you create infrastructure in AWS or in any of our cloud provider, that you can screw it up. Some of the things historically have made it very easy to scrub. Typically, a few years ago, when you created the Splicet, you could create it accessible to authenticated users, which actually meant accessible to anyone in the world who has an AWS account. And people understood this as accessible to everyone in my own AWS account. So typically for that, there was a lot of confusion.

[00:08:25.010] – Christophe
And that’s why I say that it’s also a responsibility of the cloud providers themselves to make sure that they have secure defaults. And it’s a bit too easy to shift the blame to the individual people making the mistakes as opposed to triumph companies that are not investing in New York and secure defaults.

[00:08:42.730] – Ned
Right. I think another good example of that is Microsoft when they first launched. I think their sequel as a service. It launched with a public endpoint available to anyone by default. Ensure was still protected by username and password, but that wasn’t a great set up to just expose your SQL server to everyone, so they changed that now. When you launch it, you have to give it a range of IP addresses to allow, and it blocks everything else by default. That’s a good move on the vendor side.

[00:09:09.310] – Ned
What can we do as practitioners to help combat security issues either before deployment or once they’re out in the wild?

[00:09:17.030] – Christophe
Yes, I think something that is probably the case for a lot of things in security is that you won’t be able to find all the security AWS in one go. So from my perspective, it makes a lot of sense to try and claritize what kind of things you want to find and what kind of flows you want to avoid. Typically, some people want to make sure that they don’t have too much exposure to the Internet, in which case you could focus on trying to make sure that everything you launch is sitting inside private network or has a firewall or has specific set of IP ranges that can access it.

[00:09:54.680] – Christophe
Some people are more concerned about encryption, or maybe about identity and access management. How do the access get provisioned? If I have someone that leaves the company, does it get removed automatically, et cetera. So I think the key is trying to focus on a class of vulnerabilities, especially if you’re a cloud company and you had a repeating set of vulnerabilities for a few months for a few years. I think it makes sense to focus on that class of vulnerabilities. Typically, if you’ve seen that you have some people creating publicly available Elliptic search clusters or things like that, it makes much more sense to focus on that.

[00:10:34.920] – Ned
Right. So first you have to know what’s the thing I’m trying to protect against, what’s the security problem I’m trying to solve and then apply that to the cloud infrastructure that people are standing up. You can’t just try to apply infrastructure blindly because that’s too big of a surface, I guess might be the right word for it. Let’s dig into some specifics here, and I want to focus on a blog post that you wrote. A very thorough post. I might add that we’ll definitely include in the show notes you were looking at TerraForm and some static analysis security tools for TerraForm.

[00:11:10.130] – Ned
So first, can you set the stage? What is the static analysis tool for TerraForm? What is it trying to accomplish? Yeah.

[00:11:17.580] – Christophe
So generally the idea is ready to look at the TerraForm code and PaaS it and see if there is anything that stands out from a security standpoint. It might be a security group that has been open to the Internet. It might be something more complex, like a security group that is attached to an instance that is sitting in a public subnet and with some more logic. But generally the ID is ready to look at the infrastructure as code. So in this case, we are just talking about platform, but it could be also for confirmation of Blue Media or other technologies and flag or not any security issues.

[00:11:53.280] – Christophe
And I’m talking about TerraForm code, but we’ll get back to it a bit later. There’s also opportunities to look at the TerraForm plan, which is going to be slower to do, but also it’s going to give us more context and allow us to have less false positive or false negatives.

[00:12:10.090] – Ned
Okay. So it’s kind of like a rules engine. Basically, it looks through the code. It’s got some rules or things that you want to look for, and it tries to match those rules against what it sees in the code. And if it sees something that matches, it raises that AKS a flag. Is that basically what each tool does?

[00:12:30.370] – Ethan
Okay. Is it pattern matching then, Christophe, or is it more? I don’t know. Is it more intelligent as it looks at things or is it looking like in your blog post, I think you gave an example of a Wideopen subnet zero. So is it like pattern matching against that string to try to find Vulnerabilities?

[00:12:52.510] – Christophe
Yeah. Generally it mostly does pattern matching. And there are some tools that try to have more capabilities and kind of build a graph of how the different resources are linked together and be able to do some rules based on that. So one of the examples would be if you have a VPC that doesn’t have a VPC flow log linked, but you could flag on that.

[00:13:14.130] – Ned
Okay. So speaking of the tools that you looked at, what were some of the evaluation criteria you used to differentiate between the different tools you were looking at?

[00:13:24.980] – Christophe
Sure. And I think it’s something that is important when we try to evaluate things is to first lay down what is important. What we think is going to differ between the tools and then do the comparison based on that. So in this context, what I found to be useful is to look at a few things. First one is going to be does the tool allow to scan only the platform code, or does it also allow to scan the term plan? We have some tools that can do both and some tools that can only scan the code or the plan.

[00:13:53.600] – Christophe
And basically this is going to define when you can run it, how easy and slow or fast it’s going to be. So it’s pretty important depending on the use case. Then it’s interesting to look at the maturity, the governance. So basically, who is using it, who is building it? Is it a company? Is it a single person that might just disappear tomorrow and kind of where the two wants to go and what they want to be in the future? And another thing that I think is very important is about Usability and developer experience.

[00:14:25.450] – Christophe
So is it going to be a single binary that you can just run. Is it going to be a Python thing that you need to install? How easy is it to use and to work with on a day to day basis? Something else that is interesting to look at is how the tool allows you to build custom checks. So typically some are going to use Python to write your custom checks. Some are going to use YAML, and depending on that, it’s going to be easier or harder to write, but also easier or harder to test.

[00:14:54.180] – Christophe
Maybe you want to write some rules and have some unit test for your halls. And finally, it’s interesting also to see if the tool has some basic checks and if they support AWS Azure GCP. And basically, if you have to write your own rules, or if you can reuse something that they already provide with the tool.

[00:15:15.130] – Ned
Okay, that’s a lot of criteria that you throw out there. So I just want to back up to a few of them. The first one was the type of scan because you mentioned scanning the TerraForm code directly or also scanning the TerraForm plan. Why would I want to scan the TerraForm plan? What would be the benefit to me if a tool can do that?

[00:15:34.820] – Christophe
Sure. So basically, when you scan only the TerraForm code, you might lack some context. So typically if you use some TerraForm input variables, data sources. If you are reading from a file, doing some concatenation, using maps for each et cetera. Depending on the cases you’re going to miss things, maybe you’re going to find a situation or to something that is passed through a viable. And so if you only look at the code, you’re missing some context. So it makes sense also to try to look at the different plan, and it’s basically a trade off, because when you look at only the code, it’s going to be very fast, but you’re going to have some false negatives and maybe some false positives.

[00:16:13.810] – Christophe
While you are looking at the telephone plan is going to be slower because you need to generate the TerraForm plan. So you need to be authenticated against a real AWS account, which is going to take a bit of time. You need to generate the telephone plan, and you need to run the scan against the telephone plan. Typically, you’re going to be more precise looking at the TerraForm plan, but it’s going to be a bit slower and it’s going to be very fast to look at the TerraForm code.

[00:16:40.960] – Ethan
But you are going to miss a bit of context the difference between looking at a plan that hasn’t been formed yet. And so there’s some variables and placeholders there versus actually generating the plan. Where now you’ve got specifics. When you analyze that specific plan, you see everything that’s going to be done, and you’re more likely to catch something.

[00:17:00.340] – Christophe
Right. And there are some tools that try to do something in the middle that basically only look at the code, but they try to look at what is the default value of a variable, and they try to emulate that. And they even sometimes support some basic telephone features like function, trying to get a bit of intelligence. Just looking at advocate.

[00:17:21.250] – Ned
Okay. Right. Yeah. One of the biggest challenges with infrastructure is that it’s hard to build a mock up. You’re either applying it against a real cloud environment or you’re not. That can make the planning process a little more difficult. Another thing you mentioned was maturity of the solution and the governance of the solution. One concern was if there’s one person supporting this thing and they disappear off the face of the Earth, that’s kind of the end of the project. How much can these platforms be considering how old TerraForm is in terms of I think it’s only been around for five years in a serious capacity.

[00:17:58.220] – Christophe
Yes. I think we are seeing a lot of interest in this kind of technologies. So if you ask me the question two years ago, it would probably be a very different answer. But now what we see is that basically there are a lot of companies that have a commercial offering and also develop these tools as part of open source contributions. So if we look at typically the five tools that I talk about in my blog post, four of them are developed and maintained by a company. So I would say that the space has grown pretty fast to go from a space where you have almost no tooling or you have to do things very manually to something where you have a good ecosystem of tools that are complementary and that are getting better.

[00:18:42.190] – Ethan
A bunch of tools, actually, why don’t you give us an overview, Christophe, of the tools that you evaluated because there were several, and I think there were maybe some that you decided to not evaluate walk us through what you tested.

[00:18:54.310] – Ned
Yeah, sure.

[00:18:54.890] – Christophe
So first thing and maybe two small Disclaimer is that I don’t pretend to have a comprehensive list, and it was mostly at first evaluation for myself. So I didn’t include first commercial tools but didn’t have an open source or a free version. And I also didn’t include the tools that are more generic and that could support therapy, but don’t really support it as a first hand citizen. So typically I’m talking about Inspect Open Policy Agent and Same Rep, although same Rep, I think, has better support for telephone.

[00:19:31.220] – Christophe
I didn’t include any of these tools in comparison. So the tools that I included are Tfsec, terrascan Checkoff, Regular and TerraForm Compliance. So Tfsec has been recently acquired by Aqua Security. Terrascan is developed by Accuracy. Jacob is developed by British Crew. Regular is developed by Fug and TerraForm Compliance Inc. Has an individual maintenance.

[00:19:57.130] – Ethan
Are all these tools more or less equivalent tools? They just kind of approach it in a different way? Or do I use different of these tools for different things? Maybe like a different place in my workflow?

[00:20:09.190] – Christophe
Yeah. So some of these tools are going to be much easier to get started with. For instance, if I take Tfsec, it’s a single binary, but you can just take and run on any TerraForm code without any kind of configuration, and it’s very fast and it’s very easy to look at the output, but it’s going to be maybe harder to customize, harder to write some custom orders for when you have some custom orders, it’s going to be harder to write some unit test for it AKS opposed to if you have, for instance, regular, which is going to be more extensible.

[00:20:40.870] – Christophe
So typically you can write some rules using the OPA language, which means that the rules are harder to write, but they’re also easier to unit test, and it’s easier to make it more extensible and to write your own checks, even write some things that are more complex. So I would say it boils down to the effort that you can put into it, and also at which stage you are. Maybe if you are starting from Zero new company or a new infrastructure repository, it makes sense to use something that has a lot of building rules because you can afford to have very quick feedback loops and to follow a more opinionated set of rules.

[00:21:17.300] – Christophe
But if you have already a big infrastructure code base, maybe you want something that is less open to and for which you can write very granular checks to start with.

[00:21:28.810] – Ethan
Now, when do I actually run these tests? Is this part of a pipeline, or is it I’m writing something on my workstation, and before I even pop it into the pipeline or add it to the repo, I want to scan it there.

[00:21:43.390] – Christophe
It’s a great question, and I think the answer is there is no good answer. But generally my approach would be to try and run it in different locations. And first thing that I really advocate for is that it should be very easy and doable for any developer, anyone writing infrastructure running locally on the machines. And that’s because if you’ve read the book The Phoenix Project, the second way of DevOps is shortening feedback loops. So really, I think it’s essential that as you are writing code, you are able to see if you are introducing any vulnerability or any flow as part of your code.

[00:22:23.320] – Ethan
Well, maybe it’s not an either. Or maybe I do what you just said. Shorten that feedback, we run it on my local workstation, make sure what I’m committing looks like it’s good stuff. But then I could also run an additional code or an additional check when that code is checked into the repo.

[00:22:39.570] – Christophe
Right. So maybe we can start from the very right and see how we can shift it left. And by shifting, I don’t mean it has to go only to the left, but it’s additional opportunities that we have to make the feedback loop for her. So typically it could only run on your main branch every night. Right? It means that your feedback loop has a time of one day, or maybe you’re going to be able to run it on every pull request. It still means that as a developer, I need to write my card, commit it, push it, make a pull request.

[00:23:11.590] – Christophe
The next step is to run on every commit, so I can have the checks before it’s even a pull request. The next step is to not run it on a commit on my version control system, but to run it on my laptop, maybe I have a command line interface to run it and to see the results. The next step is going to be to run it automatically, like before I do a commit. So as you say, with a pre commit hook, which is basically going to run it when you have stage your card and you do the commit and it’s going to cancel your commit if something is wrong and if you want to push it even further, you could say that it should be a plugin inside Visa Studio code that is going to flag Visa as your type code.

[00:23:55.400] – Christophe
Right? It’s the same when you do linking or things like that. It’s something that has not been really developed yet. I think Checkoff has a pretty basic extension from Visual Studio card.

[00:24:08.440] – Ethan
But I think it’s a good opportunity and something that we’ll see more and more that last one you mentioned sort of appealed to me and also scared me because I was like, well, I don’t know what the final form is going to be yet, so then it’ll be flagging you because of something you did temporarily. But then we always also know that temporary things tend to become permanent things, whether accidentally you’re on purpose. So maybe it’s not a bad idea anyway. Yeah, but as you said that’s early days before we even have a plug in like that for Vs code.

[00:24:35.230] – Ned
I do like the idea of having it linked to a certain degree, but maybe not too aggressively because I do just want to be able to write the code initially and then, like you said before I commit when I’ve staged it. Okay. Now do a Fuller scan and pull out some stuff, because that should be close to the final version of my code. At what point would you try to run a scan against the TerraForm plan? Is that something you would do at the PR stage or when it’s being merged and pushed out to a deployment environment?

[00:25:06.930] – Christophe
I would say it should be aligned to what your continuous duration workflow looks like. So if you already have something that is going to take your pull request and deploy it in a test in general, I think it makes a lot of sense to include the Tefm plan scanning there, because basically you already have a term plan, and you can just scan it if you don’t have any CI in place and you have to put in place a whole CI pipeline just for that, it’s going to be more challenging, of course, but yeah.

[00:25:34.990] – Ned
Okay. So again, it depends on the workflow, but definitely if you’re running TerraForm plan, that’s when you should run the scan, and that rhymes. So I can remember that. That’s good for me. Now. One thing I tend to use in my TerraForm code is as a decent amount of modules, and that’s not necessarily code that I’m responsible for, and it can change over time as new versions of the module come out. How are these tools handling that code that’s stored in the modules?

[00:26:02.320] – Christophe
That’s a great question. So if you just look at the Tarfam code, most of the tools, they are not going to look at the code of the Tiffany module because it will be slow, right? They need to pull the module, they need to look at the code. And maybe the module is using other modules. When you are looking just at the Tarfam card, you are not going to scan what the modules look like. But when you are going to scan your telephone plan, the Teflon plan is going to include any resource that is going to be created by the modules.

[00:26:31.690] – Christophe
Basically, if you scan your telephone plan and you are using a module that uses an insecure resource, maybe it’s a module that you have just taken from the platform registry that is exposing a database to the Internet, then it’s going to flag that. And there is an interesting report from Bridge, who they are the makers of Czecho. I think it was last year in 2020 where they basically looked at the TerraForm modules pushed on the TerraForm registry, and they basically scanned them and they showed that 45% of it had some misconfigurations.

[00:27:06.250] – Christophe
It was more or less bad, but typically it was missing some logging, some backup things, some encryption, or it was too exposed. And this was the case for Azure, for GCP and for AWS. So I think the point here is that when we are using new modules that are coming from a public registry, it makes it even more critical to start scanning at your platform plan. Otherwise, you really Azure introducing some resources that are misconfigured or not configured as you would think they are.

[00:27:36.330] – Ned
Okay. Have you seen anyone scanning modules and then adding them to an internal registry of like approved modules so they don’t have to worry about a change being introduced by someone who’s maintaining that module.

[00:27:48.890] – Christophe
I’m not convinced it’s a good approach, because then you ask the question, who would approve that? And you start to have a Bolton ace in the middle. So I would say that starting to scan the plan instead makes sense. But yeah, right.

[00:28:02.700] – Ned
Yeah. That makes sense. Scanning the plan. It doesn’t matter what’s in the modules, because now you’re getting the actual resulting resources that are going to be created, and you can scan those for all the vulnerabilities. And if you see something that changes because of a module, you can always use an earlier version or fork that module and start maintaining it yourself if you really wanted to. Once my infrastructure has been provisioned out in the world, ideally, I’m going to follow the best practice of making all my changes through infrastructure as code, but we know sometimes that doesn’t happen.

[00:28:37.130] – Ned
People go in and make changes because it’s an emergency or something like that. Can any of these tools do a comparison between what exists in the target environment and what is in my TerraForm code? Yeah.

[00:28:49.500] – Christophe
And that’s a great question, because obviously, if we are just looking at the TerraForm code or TerraForm plan, and we are deploying that in an account where everyone has administrator access at some point, basically, one of two things is going to happen. People are going to modify some things that have been deployed a year ago for infrastructure code, but now they have lifted or they Azure just going to deploy additional resources that have not been scanned or and that hasn’t been through the infrastructure process. So that’s a great question.

[00:29:18.940] – Christophe
Generally, I think it’s a different scope, but there are a few tools that aim at solving that, and one of them is called Drift CTL. And basically what it does is that it both looks at what’s in your AWS account and what is in your telephone state. And basically it’s going to say, hey, you have this attribute deployed in your AWS account, but I don’t see it in your telephoned state, which means that someone has created it manually. And maybe that’s not what you want. In some cases, there are some changes that you can do manually, but you would expect TerraForm to see and to override.

[00:29:53.290] – Christophe
But actually, TerraForm is not going to handle. Well, typically, if you create a new security group, hall telephone is not necessarily going to consider that it needs to overwrite it on the next TerraForm. It might consider that it’s a new resource, a new TerraForm security group resource. And that is just not going to handle it, which means that even if you are deploying something for an infrastructure Ascot, and you expect that any change to it will be overwritten, it might not be the case. And there is a very nice backpost from the makers of Drift CTL on that.

[00:30:28.670] – Christophe
But we will put in the show notes.

[00:30:30.500] – Ned

[00:30:30.830] – Ned
So yeah, to boil that down in my head a little bit, I’m imagining I have a security group that I create with TerraForm, and then I create ten rules for it or something like that. And then someone goes into the console and adds a rule that allows SSH access from anywhere. It’s not part of TerraForm configuration, but it’s not something that TerraForm is going to overwrite. So now I’m potentially vulnerable, and I have no idea because my code looks great, right? Okay. So that makes sense in my head.

[00:31:03.370] – Ned
Do any of those tools also just scanned the general cloud environment for potential security violations? Or is that a whole other group of products out there?

[00:31:12.490] – Christophe
Generally it’s the same. I mean, drift detection is very much security. I would say that drift detection helps finding things that are vulnerable at runtime, but you have other ways to find it as well, because maybe sometimes you are not deploying things via TerraForm, and it makes sense. Maybe it’s a vendor thing that you have to deploy with cloud formation. Maybe you have a Lambda function, but dynamically creating easy two instances. Maybe you have a lot of scaling vROps that is going to do that for you.

[00:31:41.950] – Christophe
Maybe you have a bad script that your operations team is using, and that makes sense. In some cases, you cannot use detection, and it’s things that you won’t see for the set of products for us generally is called CSPM. So cloud security posture monitoring tools. And generally it’s a separate set of tools and technologies. But there are a few offerings, especially from the backing company of Regular, which is called Fugue and the backing company of Checkoff, which is called Bridge Row, where they try to use a single set of rules and a single tool to be able to scan the term code and the telephone plan and what is running in the AWS accounts at runtime.

[00:32:27.310] – Christophe
Now. Obviously, this makes it a bit changing because you have to think about how can you reuse the same rules to scan the different plan and what is running in your account. So I would say that generally people do it using different tooling, different set of products. It’s mostly commercial products, but there are also some open source tools like Polar Cloud Custodian, and I think the maturity on that will grow in the next few years trying to get something that you can use at every phase at one time, and especially that reuse is the same set of roles, because if you are not scanning for the same thing at every stage, then maybe you are going to end up with some vinyabilities in production, but you wish you had scanned before.

[00:33:16.510] – Christophe
So it’s changing to maintain different set of rules and to test them and to make sure we vote for positive. So I think we’ll see some big enhancements on that in the next few years.

[00:33:26.660] – Ned
Yeah. I certainly dealt with the issues of having different scanning tools that all have their own rules and trying to get them all the same across the different tool sets is maddening. So the idea of having here’s our dedicated rule set for whatever certification and security we want, and each tool just uses that rule set to scan. That would be ideal. I would really like that. So hopefully, hey, companies out there get on that and let me know about it in terms of writing the rules or at least using existing rule sets.

[00:33:57.440] – Ned
Do all of these different tools come with a baked in set of rules for best practices, or maybe for specific compliance needs, like PCI, DSS or something along those lines?

[00:34:10.150] – Christophe
Yes. So check out regular terrascan and Chef sake. They all ship with a set of rules that work for AWS, Asia and GCP. What they detect or what they try to find is going to be different for each tool, and sometimes it’s going to be losing or maybe more informational, maybe more about logging or best practices more than actually a security risk. So most of these tools, they provide something by default. I think it makes a lot of sense to review what they try to detect and to make sure to only include the rules that you care about, especially if you are already using infrastructure Ascot and you have a big set of repositories.

[00:34:52.290] – Christophe
Otherwise, if you run TF or any of these tools on a big infrastructure repository, you’re going to have a lot of findings. And generally, when trying to integrate the infrastructure scanning something that already exists, I think it makes sense to do the other way around. So instead of excluding some checks only start to include a few ones that are high value, and that don’t generate a lot of positive to make sure that you’re not detecting everything but what you detect.

[00:35:21.980] – Ethan
You are pretty confident that it’s an actual risk that you would just you’re saying if I’ve got a big repo already of TerraForm stuff and I just tell the tools go for it with all their default settings, I’m going to be really scared by the results. Is that what you’re saying?

[00:35:35.340] – Christophe
Yeah, right.

[00:35:37.240] – Ned
Just overloaded with information it reminds me of. I don’t know if any of you ever worked with System Center Operations Manager back. It’s a Microsoft product. When you turn this thing on, it had like 1000 rules and it would collect logs and it would just overwhelm you with information. And it was mostly false positives. So the idea of let’s start with everything turned off and then just slowly turn rules on makes a lot more sense to me.

[00:36:05.730] – Ethan
Well, the high value stuff pick the big winners.

[00:36:09.330] – Ned
The low hanging fruit.

[00:36:10.250] – Ethan
The things that the s three buckets that are unsecured go for that stuff. And by the way, that same thing with intrusion detection systems. If you just let it go with defaults, you will be buried in alerts millions of them on your even small infrastructure. The important things get lost in the noise. You know what I’m saying. And so you really do want to ratchet back those alerts to get just the things that are going to be the big WinRM. And then right, turn up the volume a little more.

[00:36:37.100] – Ethan
Detect a few more things and go from there.

[00:36:39.370] – Ned
Is there a way to exclude certain resources from a rule if you want it set up a way that would technically violate the rule. But in this particular case, it needs to be set up that way. Is that something you can exclude so you don’t get that false positive all the time?

[00:36:54.300] – Christophe
Yeah, that’s a good question because you might want to write a rule that flags public attribute. But when you are deploying static assets for your website in a bucket, it makes sense, and you don’t want to flag that. So generally some of these tools allow you to suppress the findings directly in your telephone code. Basically, if you are writing resource AWS three buckets something you can add a comment just above that says for this specific role, I don’t want to flag this resource because it makes sense in this context.

[00:37:25.790] – Christophe
And I think Czechov regular, Terra, SDWAN and Cfsec allow you to do that. And some tools also allow you to set this behavior at one time.

[00:37:36.830] – Ned
Okay, so if it’s one of those other tools and I want to just suppress the rule for this specific configuration, I can do that by specifying it as a flag in the command Linode as opposed to a comment. That’s interesting. Okay, that’s a good way to get around it. What about writing rules? Writing your own rules? I think you mentioned a bunch of different ones. Yaml is one regular has its own language. Which one did you find was the easiest to use, especially for someone who’s not a deep programmer, and which one had the most robust language, if you will.

[00:38:09.980] – Christophe
So I think Jeff and Chefs are the easiest one to get started with. Checkoff allows you to write custom rules in Python, which means that it’s pretty easy to write it uses to write unit test for it as well, and TFS allows you to write very simple Jamal holes. But basically you can say for this type of resource, if this field has this value, I want to flag it, then it gets a bit harder when you want to write more complex holes or when you want to write unit test for your role.

[00:38:38.720] – Christophe
So typically here I see that regular has a very good way to write the roles using the regular language, which is the language used by Open Policy Agent, which allows you to do some multi resource coalition. So basically to write some things that are more complex. Like if you want to see when the security group is open to the world and it’s connected to an instance. And this instance is in a public subnet that has a route to the Internet, which allows you to write roads that are much more complex, much harder to write, but also need less false positives and are going to flag anything.

[00:39:11.500] – Christophe
But the actual situation is given the context that you are in, it makes it also easier to unit test. Typically, Regular has very good framework around how to unit test your roles. And basically you go from infrastructure to infrastructure as code, and then you go to writing holes to writing AKS code to unit testing your halls. Right. So I think Regular has a great documentation and great workflow to do it. So it boils down to how complex and how advanced you want your checks to be.

[00:39:42.690] – Christophe
And also if you plan on writing a lot of custom checks, or if you plan to rely on what is already provided. Okay.

[00:39:50.360] – Ned
And you keep mentioning unit tests for your checks. Why is it so important to have unit tests for the custom checks that you’re writing?

[00:39:58.430] – Christophe
Yeah. And I think it’s kind of liability that security people have towards developers and engineers and people that actually work with the code is that if I’m a security guy writing halls, I want to make sure that my halls are working as I intend them to and for this purpose, I think it’s a great idea to be able to write unit test for them, which means that you can also write regression test if someone is telling you, hey, you have a false positive because your role is not working.

[00:40:25.040] – Christophe
Well, you want to make sure that you are not going to reintroduce this again in the future in the definition of your role, and you can write. So that’s why I think it’s important. And I wanted to also show in this comparison what tools make it easy to test your custom checks.

[00:40:44.490] – Ned
Okay. So test your checks that check your security for your cloud infrastructure, right. It sounds like turtles all the way down, but it does make sense. You want to unit test your test to make sure they’re doing what you expect them to. You don’t have those false positives. I’m curious looking at all these different tools. I’m sure you thought of things that the tools don’t do yet or that you wish they would do. So if you could speak to the different vendors and tool makers out there, what are some things you wish that the tools did or did better?

[00:41:14.340] – Christophe
I think as a first step, it’s a good idea to be able to put these tools together and basically compare them because competition is always good for the end consumer. Right. And I like to think that even being able to show these tools together has already driving some improvements. So typically, if you look at the change level of his blog post, you will see that over time some of these tools have implemented some new features to make sure that they weren’t left behind the others. And I think that’s great generally in terms of what personally, I think I’m missing, and I would love to see in the future is more integration with ID.

[00:41:49.140] – Christophe
Checkoff has already something on that, but it’s pretty basic, and I don’t think any other tool has something like that. So you don’t have any built in ID integration that works really well, et cetera. It’s also still hard to manage the roles centrally, typically to have some central visibility when you have 20 infrastructure repositories and you are running scans on them, how do you see your findings from a centralized place? How do you manage your rules and your exclusions centrally? And typically for now it’s only commercial products.

[00:42:21.120] – Christophe
So I would love to see some work on the open source side as well, and a bit similar to that. I would love to see more tools being able to support more run time things that I think only check off and their commercial offerings and regular and the commercial offerings are starting to do. But the idea being really around having the same set of rules for the world development lifecycle and not have to maintain different set of holes for that. That’s kind of the three things that I think will be interesting to see in the future.

[00:42:59.190] – Ned
Yeah, I think the second one you called out there is probably something that’s going to be mostly in the paid version of the products, because that’s their upsell, right. How do we get you into the paid version? Well, you want this special feature, you got to pay for that. And hopefully enterprises will if they find value in it. But I understand the desire to have it available in the open source tool, especially for those of us that don’t have a big enterprise behind us that’s willing to give us to throw lots of money at a project.

[00:43:25.670] – Ned
Well, this has been a fascinating conversation about security tooling, especially around TerraForm, but I think these ideas could be broadly applied to any infrastructure as code. Christophe, could you please summarize things and maybe three key takeaways for the audience today?

[00:43:43.060] – Christophe
Yeah, sure. So I would say that the first takeaway is really try to minimize the noise, so we shouldn’t try to detect all the security misconfigurations or all the security mistakes. We should try to focus on things that matter for us and try to minimize the number of false positive. Otherwise it’s just going to end up being another tool that people will use, but they will actually not take into account. Second one is that using infrastructure Ascot, we have a very good opportunity to find these configurations before they go to production.

[00:44:15.310] – Christophe
So using something to scan your infrastructure code and plan is a good opportunity to make sure that things get detected before they go to production. And for the third one, I would say that it’s great to shift left, but it’s also important, and there was a great talk about that from Jeremy Mates at ospin March in Geneva. It’s also important to start left. So basically, if you are scanning telephone code, you need to make people that are writing this code aware of what you are expecting from a security standpoint, and they shouldn’t discover it when they are scanning barcode got you.

[00:44:50.320] – Ned
All right, so set the standard first and start already secure, and then you don’t have to slap that security on later. That makes a lot of sense to me. If folks are looking to get more from you, they want to read some blogs or follow you on Twitter. Where are some good places to find you on the internet?

[00:45:05.900] – Christophe
You can find me on Twitter at my blog. I need to post things, maybe three or four times a year, so it’s not much, but I’m trying to get it going. Rated to Awscurate and yeah, that would be it awesome.

[00:45:22.220] – Ned
We will include links to all that stuff in the show. Notes. Christophe, thank you so much for being a guest today on day two. Cloud and hey, virtual high five to you listener out there for tuning in. You are an awesome human being. If you have suggestions for future shows or you’d like to be a guest on a future show, let us know we want to hear from you. You can hit either of us up on Twitter. The handle is at day Two Cloud show, or you can fill out the form on my fancy website nedinthecloud.

[00:45:47.030] – Ned
Com. This is for all your vendors out there. If you’ve got a way cool Cloud product you want to share with our audience of it professionals, why not become a Day Two Cloud sponsor? You’ll reach several thousand listeners, all of whom have problems to solve. Maybe your product fixes their problem. We’ll never know. Unless you tell them about your amazing solution. You can find out more at packet pushers netsponsorship until next time. Just remember, cloud is what happens while it is making other plans.

More from this show

Episode 125