Search
Follow me:
Listen on:

Day Two Cloud 131: Monitoring The Cloud From The Cloud

Episode 131

Play episode

Today’s Day Two Cloud podcast delves into issues about monitoring all the things, including the notion of monitoring the cloud…from the cloud. Ned Bellavance and Ethan Banks discuss the pros and cons of DIY vs. using a service, differences between monitoring infrastructure stacks and applications, what to monitor and why, how to deal with all that data, the necessity of alerting, constructing meaningful dashboards, and more.

Sponsor: StrongDM

StrongDM is secure infrastructure access for the modern stack. StrongDM proxies connections between your infrastructure and Sysadmins, giving your IT team auditable, policy-driven, IaC-configurable access to whatever they need, wherever they are. Find out more at StrongDM.com/packetpushers.

Tech Bytes: Singtel

Stay tuned for a sponsored Tech Bytes conversation with Singtel. We introduce North American listeners to Singtel; get background on the global network services it provides, including Internet, MPLS, IP transit 4G/5G; and why you might want to consider Singtel for cloud connectivity. Our guest is Mark Seabrook, Global Solutions Manager at Singtel.

Show Links:

Grafana Labs

Nagios

@Ned1313 – Ned Bellavance on Twitter

@ecbanks – Ethan Banks on Twitter

Transcript:

[00:00:01.570] – Ethan
Strong DM is secure infrastructure access for the modern stack, Strong DM proxies connections between your infrastructure and Sysadmins, giving your IT team auditable, policy driven IAC configurable access to whatever they need, wherever they are. Find out more at Strong DM dot com slash packetpushers.

[00:00:34.650] – Ethan
Welcome to Day Two Cloud. We’ve got a robust discussion for you today. Ned and I are going to get into thoughts about monitoring. And it’s really a free flowing conversation where I was recently setting up some new monitoring. And Ned, of course, has got lots of background in monitoring his world. So we go back and forth on some ideas about that, and that’s going to be followed up with a Tech Byte 15 minutes segment where we discuss with Singtel some aspects of cloud and cloud networking. Cloud networking performance from their perspective. So I hope you enjoy this conversation. We were going to just drop you in, literally. Ned and I got talking. We both just started hitting the record button. You’re dropping into a conversation where we’re talking about, of all things, music.

[00:01:18.750] – Ned
So I went to a hardcore show. It was like SeeYouSpaceCowboy. Johnny Booth was playing just like a bunch of hardcore metal. And naturally I wore my Carly Rae Jepsen hoodie, which says, I really, really, really like you on the back, which. It’s adorable, right?

[00:01:40.650] – Ethan
If you’re listening to this, it came up because Ned is drinking out of a Carly Rae Jepsen mug.

[00:01:45.600] – Ned
Yeah.

[00:01:46.260] – Ethan
Ned, come on, man.

[00:01:47.540] – Ned
No, this is fun because I self consciously put the hoodie on just for fun. And I go to the show. It’s at a Polish American Hall, which gives you kind of an idea of the vibe, right? It’s in the basement of this hall. And I go up at one point to use the facilities. And as I’m walking out, the guy at the bar goes.

[00:02:07.200] – Ned
Oh, my God, Carly Rae Jepsen come here.

[00:02:10.230] – Ned
And he starts gushing about how he loves Carly Rae Jepsen and buys me a beer. And so the lesson I learned is that if you want to get free beer at a hardcore show, wear a Carly Rae Jepsen hoodie, because there are Carly Rae heads everywhere.

[00:02:26.790] – Ethan
I have no idea. I couldn’t even name one of her songs.

[00:02:31.830] – Ned
You could if you thought about it hard enough, I guess.

[00:02:35.880] – Ethan
So I have tickets I got for Christmas to go to It’s Korn, Chevelle and some other band in March. That will be fine. It’ll be a bunch of old guys in the pit being fat and remembering what the 90s were like. But I’m actually there for Chevelle. I’ve seen Korn before and whatever. They’re fine. I’m just not that depressed anymore. So it’s a little hard to listen to Korn at my age.

[00:03:00.540] – Ned
I agree. It doesn’t hold up for me either. It was much more relevant when I was 20.

[00:03:05.220] – Ethan
Yeah.

[00:03:06.060] – Ned
Than when I’m 40.

[00:03:08.070] – Ethan
Chevelle for me. They’ve still been putting out new music. They’ve advanced their sound. It’s a very similar sound. They’re not trying to alienate people from album to album by going, oh, we’re going to be really experimental in this one and try keyboards. They’re not doing that at all. It’s still very much the same thing, but they’ve definitely advanced their sound where I still care about Chevelle in 2022. So I’m looking forward to that part of the show. Then there’s a third band I don’t even remember to look and see. I did not recognize them.

[00:03:37.000] – Ned
Yeah, cool, man.

[00:03:38.870] – Ethan
But that’s the only concert I have on the schedule with Covid and all that being what it’s been. It’s just been too miserable trying to get to shows. But maybe that’s behind us with Omicron, feels like.

[00:03:51.460] – Ned
It might be after January.

[00:03:53.150] – Ethan
Yeah, well, that’s just it Omicron feels like it’s sweeping the world. It’s not going to kill us all, but maybe provides the immunity that we can let Covid be a thing we’re going to have to deal with for the rest of our lives, but kind of fades into the background rather than being a decision point of whether we go to events and all that kind of thing.

[00:04:10.390] – Ned
It’s where pandemic changes to endemic.

[00:04:13.650] – Ethan
Yes.

[00:04:14.630] – Ned
It’ll just be like the flu.

[00:04:16.710] – Ethan
Yeah.

[00:04:17.690] – Ned
Yes.

[00:04:18.260] – Ethan
So Ned, yesterday I spent a goodly part of my afternoon working with Grafana Cloud. So the whole point of that is I manage a bunch of IaaS based web servers. Typically it’s WordPress, so I’ve got NGINX as the web server running these things with certificates on them. It’s Linux. It’s going to be. It’s probably Ubuntu, some LTS flavor of Ubuntu, and MySQL on the back end for people that don’t run WordPress service. You’re running my sequel or the MariaDB flavor of MySQL whatever it is. And so I’ve been looking to up my game and monitor this stuff as the web servers, depending on what organization I’m supporting. Like, if it’s packet pushers, we keep getting more and more load every year, more RSS feeds that are getting pulled, and more people hitting sites and looking for podcasts and getting directed to us for search engines. And that’s all critical for the infrastructure and for the operation of the business. So we’ve been getting away for a while with not having to monitor too close. More like the up down service. Oh, your site’s down. Oh crap. What’s going on? Oh, the database crashed and reboot the server.

[00:05:28.930] – Ethan
Everything is fine, but I’ve been wanting to get more Proactive with a bunch of stuff. And so enter Grafana Cloud. Grafana Cloud is the Grafana service, but all the monitoring is done in their cloud. It’s SaaS. It’s monitoring SaaS. You sign up, there’s a free tier, which is what I’m playing with right now, where you can sign up and then say what integrations you want. You fire up the integration, it dumps a script for you that is already canned. They’ve got it automated with all your API keys and everything already just done for you in a single line of bash. Dump it in on the CLI and you’re polling it’s that quick. I had to punch a couple of holes in the firewall for TCP 9095 and 12345 for the gRPC and Http listeners. And then that was mostly it, at least to monitor the Linux stuff. There’s more to it. To monitor NGINX, you’ve got to build out a JSON access log, but again, they give it all to you. It’s literally copy paste, it’s a text file. And then you set your different server instances to log to that JSON formatted log, which Grafana will then begin parsing and sending up to the Grafana cloud.

[00:06:45.650] – Ethan
Now, going back to my Grafana Cloud instance, I can now see with the Dashboards that are again all canned for me as a starting point, beginning to get valid metrics off of these servers that are running this stuff. It felt very modern and Hip Ned.

[00:07:03.390] – Ned
Well, that’s us. We’re modern and hip. I think that’s clear.

[00:07:07.650] – Ethan
Well, back in the day, what would we run? I think you have some SCOM experience and some I don’t know what all else.

[00:07:14.750] – Ned
But SCOM Nagios I used for a while and various other monitoring tools. The biggest challenge with most monitoring tools is not necessarily the setup, it’s the tuning and making sure you’re plugged into all the important things about your application. Because essentially what you really care about is, is my application up and is it functioning at an appropriate level? Can I meet whatever SLAs I’ve invented for myself? So that might be that the monitoring software has to keep track of what’s going on with your operating system, but it also needs to keep track of what’s going with your my SQL database or MariaDB. And it might need to keep track of your web server, whether it’s NGINX or Apache or something else. And it needs to understand how those applications are supposed to function and when something is wrong. So I guess from the Grafana and especially Grafana Cloud standpoint, were you able to tell if this is the type of thing I’ll be monitoring? So set up the appropriate collectors for it, or does it just kind of like collect everything and then you filter it once it’s up in Grafana?

[00:08:19.620] – Ethan
No, it’s the former. The integrations that I set up were three specific things. I set up integrations for just the Linux based operating system, for MySQL, and for NGINX, they give you canned Dashboards. Like, we think this is a good place to start for your dashboard. And so you immediately look at the dashboard like for Linux, and you get a bunch of fairly common sense metrics. If I can click over there, I can actually tell you some of what is it? Well, I’m on the NGINX screen right now. I just popped that up. It’s telling me things that I didn’t have to configure any of this. It’s built the dashboard that’s telling me things like total requests, unique users right now requests per status code, as in Http status code. Like how many am I getting that are 200s? How many am I getting that are 301s or 302? I have a bunch of redirects on this one particular site I’m testing with. So that metric actually matters to me. Like I’ve got a bunch of 503s. I’m going to investigate that. Why am I getting a bunch of not a bunch of 503s , but enough that it’s interesting and they break it down in that way with some kind of common sense graphs.

[00:09:22.730] – Ethan
So you get a good sense of the health of your web server where things are at, and then other things that are nice and top requested pages and user agents and refers things you see in anything that does Http server log parsing that is useful. And if you want, you can break out and dig into all the JSON for any individual request that you want, you can drill right in and it’ll break out. It doesn’t show you raw JSON, it breaks it out for you in a nice way. Now if I want to build other panels, I can get as gnarly into it as I want. It lets me go into the dashboard, add a new panel, and then look at the data source of oh boy as you drill in any crazy thing that you want, which is especially overwhelming when I was looking at the Linux OS stuff because there’s so much data there Ned of things that you can monitor. Like if you were a network person back in the day with SNMP. Well, there’s a lot of SNMP MIBs you can pull. So which of them are useful? So Grafana Cloud is giving me a starting point with the ability to get as gnarly and detailed as I want.

[00:10:34.200] – Ethan
Maybe I care about network queue depth, which might tell me if I’m running into network congestion and that kind of stuff. It’s not showing me that by default, but I saw that it’s there. So I think what this is giving me is again, that good common sense starting point where I can pretty immediately see good information on what’s normal and probably set some baseline thresholds and alerting when things go out of norm, which since I just started this, I’m not even 24 hours into monitoring this particular server. I don’t even know what’s normal yet exactly.

[00:11:04.280] – Ned
Right. One of the other big challenges is actually having a solid baseline to compare performance to and having some sort of feedback mechanism for when things are going wrong. But it’s not immediately obvious through the metrics that you’re collecting. So users could be having a difficult time on your site because of the way it’s designed or the way that some of the components are Loading on their client side. That’s not something you’re going to get from server metrics. So that’s a little more difficult to get from something like Grafana. I’d imagine you don’t really have that problem as much, since I’m guessing a lot of your site is just rendering on the server side.

[00:11:43.490] – Ethan
Well, WordPress, yeah. So I mean, it typically does everything rendered server side, it’s PHP, as opposed to, I don’t know, a React framework or something where it would shove it to the browser and ask for the rendering to happen there. It is rendered server side typically. So database interaction on a WordPress site is really a key metric. Everything’s getting shoved to and from that. My SQL database reads and writes in some cases, but mostly reads when you’re rendering for folks. So you’re really optimizing on the back end for speed. You have to have a server that’s big enough to handle whatever your client load is. Caching is very important on a WordPress site as well. And so there’s a lot of metrics. I’m looking at the MySQL page on my Grafana cloud instances, the stuff that it showed me by default, what is my baseline of MySQL connections? It’s telling me that client threat activity, some stuff I’m not familiar, like MySQL temporary objects. I don’t even know what that means showing me that. But that’s not something I’m actually familiar with.

[00:12:43.700] – Ned
You don’t know if you have to care.

[00:12:44.740] – Ethan
But MySQL sorts, MySQL slow queries. How many slow queries am I getting? Zero so far, which is good because this is a lightly loaded server. I’d be pretty bummed if this thing was struggling. Mysql Aborted connections. That would be really interesting to know. My SQL network traffic, memory use all the kinds of stuff you would expect to see. Again, you’re getting that good starting point. And from again, a monitoring. This is an IaaS install. This acts as a server. Basically, it’s a VPS that lives in Vultr. Vultr being the provider of choice that I’m using here because they make a lot of things spinning up and spinning down easy. And if all you need is a small server, it’s very cost effective. This feels like a good monitoring solution to me. It’s a cloud based monitoring tool, Grafana cloud for my cloud based services that I’m using. I don’t have a data center. I do all my work online, and so this feels good. Now I guess I could spin up an EC2 instance with my own Grafana or something, but would I want to? Can you think of a reason why you’d even do that in the modern era?

[00:13:50.680] – Ned
I mean, it’s more administrative work for you. Usually the reasons that you choose not to go with SaaS, because you and I have seen the trend over the last five years or so has been try to go SaaS first, if you can. If you can’t, try to pick more of a platform as a service option, and then if all else fails, build it as IaaS, whether that’s onprem or in the cloud. And usually the reason you can’t use SaaS is a couple of things. It could be compliance. Your data can’t live in somebody else’s service, it has to live in yours. And that’s become less and less of an issue. But the other usual reason is I can’t customize it in the way that I need to for my business, so I need to host it myself so I can add those customizations. So that would be one reason you might run into a roadblock with Grafana Cloud. I don’t think you’re likely to run into that, but I could definitely see folks out there that are already monitoring and they’re considering moving to Grafana Cloud going, oh, it’s missing this feature. Or I can’t customize this thing because I don’t have access to the guts of what’s running Grafana.

[00:14:53.670] – Ned
I only have access to what they give me through the API.

[00:14:57.750] – Ethan
[AD] We pause the podcast for a couple of minutes to introduce Sponsor Strong DM’s secure infrastructure access platform. And if those words are meaningless, Strong DM goes like this. You know how managing servers, network gear, cloud, VPC, databases, and so on. It’s this horrifying mix of credentials that you saved in putty and in super secure spreadsheets and SSH keys on thumb drives, and that one doc in SharePoint. You can never remember where it is. It sucks, right? Strong DM makes all that nasty mess go away. Install the client on your workstation and authenticate, policy syncs, and you get a list of infrastructure that you can hit when you fire up a session. The client tunnels to the Strong DM gateway, and the gateway is the middleman. It’s a proxy architecture. So the client hits the gateway and the gateway hits the stuff you’re trying to manage. But it’s not just a simple proxy. It is a secure gateway. The Strong DM admin configures the gateway to control what resources users can access. The gateway also observes the connections and logs who is doing what, database queries and kubectl commands, et cetera. And that should make all the security folks happy.

[00:16:03.060] – Ethan
Life with Strong DM means you can reduce the volume of credentials you’re tracking. If you’re the human managing everyone’s infrastructure access, you get better control over the infrastructure management plane. You can simplify firewall policy. You can centrally revoke someone’s access to everything they have access to. With just a click, Strong DM invites you to 100% doubt this ad and go sign up for a no BS demo. Do that at Strong DM dot com slash PacketPushers. They suggested we say no BS, and if you review their website, that is kind of their whole attitude. They solve a problem you have and they want you to demo their solution and prove to yourself it will work. Strong DM.com packet pushers and join other companies like Peloton, Sofi, Jax and, Chime, Strong DM.com packet pushers. And now back to the podcast. [/AD] [00:16:58.050] – Ethan
Yeah, I feel the same way. Why would I host this myself? I don’t need the extra headache of maintaining a Prometheus and Grafana install. Grafana Cloud just doing all of that for me, and I will pay them while I’m at the Free. They’re calling it a free trial. I don’t think there is a permanent free tier. I think I got to pay them a minimum of $50 a month to monitor all the things.

[00:17:20.370] – Ethan
Okay, but it feels like the right thing. I don’t have data governance concerns. These are aggregated metrics. Since I’m monitoring NGINX, it is seeing every query that’s coming inbound, all the IP addresses and all that stuff. So there is knowledge of potentially proprietary data coming to and from that, I guess could be a concern depending on what sort of a company you are and what sort of data you deal with, where you’d be concerned if that data got leaked in some way or another. But it feels like I’m getting a lot of insight into what’s going on in this very simple web application. And once I get into the alerting stack, the thresholding and alerting stack, which I haven’t even tapped yet, I’m going to have better insight into what’s really going on. So there’s another feature here where it’ll integrate, where my Grafana Cloud account will integrate into a Slack channel. I haven’t done anything with it, but I set it up took all of 5 seconds because they made it really easy. So now what alerting do I set up? What’s interesting? So when things begin underperforming, when I’ve got my baselining data in place, I’ll let it run for at least a week and 30 days would actually be better, see what’s normal.

[00:18:32.410] – Ethan
And then I’ll have enough intelligence to be able to set appropriate levels of thresholding and then alerting that could do one of two things. I could see it going. I get an alert in Slack. I’m assuming this is how this is going to work for me. That something’s out of baseline and I can be the human and investigate, or depending on what the nature of it is, I could actually trigger an automation action at that point, depending on what it is. Now, historically, I’ve never done too much of this, because the kinds of things where a threshold is exceeded and you trigger an automation can be scary unless it’s very well scoped. I know that if this happens, I’m going to take this specific action if you know that stuff that feels pretty safe and controlled, but if it’s a more complex kind of a thing, or if there’s some nuance to what being out of threshold actually means, do I actually want to automate an action to handle that situation? Most of the time? I probably don’t, but we have auto scaling groups for a reason, I guess, you know what I mean? So there are certain situations like that that are pretty well defined, and you’re like, yeah, spin up another one.

[00:19:40.010] – Ned
You also want to set an effective maximum and a cool down time and a warm up time for all that stuff. So you’re not spinning up 100 instances because someone’s DDoSing you. You want to limit that. So you limit your spend. And then also how quickly does it spin up and how quickly does it spin down and drain out sessions? So there’s definitely a lot to consider there if you’re going to set up any kind of automation. And I’m not sure how that would work. When you’re doing VPS instances as opposed to having all you’re using EC2 instances behind a load balance or something along those lines.

[00:20:14.380] – Ethan
It caught my attention because it’s theoretically possible. Now, scaling out WordPress is an art. I’ve read enough about it to know that it’s not the simplest thing in the world because WordPress is really designed to be run with a sequel instance supporting a web server instance. As soon as you get into the world of I want multiple front ends to be able to handle client requests or go regional or whatever. Now you’ve got a database synchronization challenge that most of the reading I’ve done. WordPress wasn’t really built for that. So you’re a bit on your own. Again, from what I understand, if you’re listening out there and you’re like, ah Banks, it’s super easy to scale out WordPress, let me tell you. Let me tell you how. Great you’ll be on the show. Just send me an email and we’ll make that happen. But mostly it seems like that’s hard work and you can usually like Vultr has an API, and I’m sure if I spent enough time to figure it out, I could scale up the instance. But when you scale up, you don’t scale back down. That’s a one direction thing they’re happy to take.

[00:21:19.810] – Ned
Usually.

[00:21:20.200] – Ethan
money for the scaling up. They don’t want to scale your instance back down, though.

[00:21:23.350] – Ned
No, that is a challenge. Now, does Grafana also plug into the caching layer of your sites? If you’re using something like Cloudflare, will it pull metrics from that and sort of correlate that with your web servers to see can you do caching better or is it having issues caching information?

[00:21:43.950] – Ethan
I think the answer to that is where can you do an integration? Grafana runs on an agent, so the key way it is gathering data. Is there’s an agent living somewhere that is gathering that data and then shoving it up to what I assume is how they’re gathering the data centrally, that agents got to be able to push that data up. So can I do Cloudflare stuff? I’m taking a look at the integrations.

[00:22:08.850] – Ned
Right. They would need to have some sort of pull model for that data through an API, because Cloudfare isn’t going to let you install the Grafana agent on their servers. So I got to imagine they have some type of integration with a lot of different things that don’t allow you to install an agent.

[00:22:23.870] – Ethan
Exactly. And there is some of that. So for example, there is an integration for AWS, Cloud Watch metrics haven’t set that up. I’m not running anything in AWS at the moment, but there would be a way I could at least pull from that. I don’t see anything for Cloudflare especially, but they do for other services like GitHub. That’s kind of interesting. Jenkins monitor your Jenkins CI CD server. What does that integration look like? I don’t know. What would I be monitoring with my Pipeline and Graphing and Grafana? It’s an interesting thought. You can do integration with Kubernetes if you’re running a Kubernetes cluster, and I don’t want to limit this to what these integrations are. They’ve probably got, looks like about 30 of them or so that are listed here of a variety of different services. Some of them are pretty common in Docker and Envoy, etcd. That you’ve heard of, and then a few ones that, at least in my experience, are a little bit more obscure or more Dev oriented. I would say to get back to the core of the question, if you’re trying to monitor the entirety of your stack as requests come through, I think is the point you’re trying to make right?

[00:23:29.220] – Ethan
And so if Cloudflare cache, say that three times fast, is part of the equation, can you monitor that in Grafana? I’m going to assume the answer is yeah, probably somehow. But it may not be as easy as, oh, they have an integration for me already in Grafana Cloud, and I just got to click the button and paste the script and it works. It’s probably going to be a little more challenging than that again, since it all seems to revolve around that agent that’s got to sit somewhere and get that data, then be able to hand that data off to the dashboards and graph it for you. Then again, going back to the question we really started out this discussion with what is it you’re actually doing with the data? Because I have been guilty in the past as a young engineer. I’m going to gather all the things and then you realize you’re filling up your disk space with things with Metrics and you’re not actually doing anything with it. What is the point of that? And so I would usually end up depending on what the application was that I was tracking, and my background is networking, so I would tend to be very network centric, but I would build a set of monitoring some that were unique to a given environment, depending on what the application was that was being built or being delivered, I should say that were custom for that, and that could raise a red flag.

[00:24:45.750] – Ethan
I know if this particular thing happens on this piece of equipment that I’m probably having this problem. And so I would learn over time that these were the sorts of things that I needed to monitor, it’s a different world. Those were environment specific sorts of monitoring that I was doing, and I feel like I’m probably going to find some things like that that I care about with this new everything’s cloud based stack that I’m building, but I don’t know what those are going to be yet. You know what I’m saying, right.

[00:25:14.990] – Ned
Another big thing about metrics and figuring out what’s important to you. There was a promise of machine learning and AI being able to kind of figure that all out for you. And I was looking over the Grafana Cloud site a little bit. It looks like they do have a certain element of machine learning that they’ve added into the platform, which might not be something you could run locally. So then that’s another reason to go with SaaS is they can develop and deploy these services that don’t make sense for a single customer running locally, but makes sense in the context of we’re going to run this for all 1000 or 2000 of our customers. So it makes sense to dedicate a fleet of machine learning servers that have all the special GPUs in them and whatnot to crunch the numbers and provide some insight back to the customer.

[00:26:02.380] – Ethan
So I’m sitting in my Grafana machine learning. I’m in my cloud account. This is not sponsored by Grafana, by the way. Although, hey, if anyone from Grafana is listening to this, we’d love to have a more detailed conversation as you come on board as a sponsor. But I’m in the Grafana machine learning page, and the way they’re pitching it is forecasting. Train ML models to forecast time series metrics into the future. So if you want to use this, of course, I’m an infrastructure guy, Ned. So I think about things in terms of capacity planning. How much of whatever the resources am I going to need going forward, which you can use for budgeting planning, staying out in front of things. And again, they’re pitching it in my mind for that metrics. Forecast, anomaly detection, adaptive alerts. The other things that they’re talking about here, forecast capacity plan, which you just said, anomaly detection detect unexpected behaviors in user or system behavior. Right. So how do I know what’s normal? While I was just talking about I should let it gather a week of data. Probably a month would be better. Maybe I should let the ML figure that out for me, Ned.

[00:27:02.410] – Ethan
It’d be really hip and cool. There’s a big button here. Initialize Grafana ML. We should do that.

[00:27:11.010] – Ned
Basically, hit it and you can hear the engines whirring up in the background.

[00:27:15.050] – Ethan
Well, there’s another feature here that’s actually really interesting. They’re saying adaptive alerts take into account seasonality to reduce alert fatigue. Yeah, it’s end of month. And so for this app, it gets hot or unusual backup cycles or end of year kind of stuff that goes through or that would be useful. Things go out of band or out of threshold for known reasons. Do I want to be alerted about that? Probably not. And if ML knows that cyclically things happen and it can adjust for that. That’s not a bad thing. I still don’t know if this is ML or just statistical analysis, but we’re going to allow it. It’s ML. We’re going to allow it because that’s what everyone calls it anyway.

[00:27:58.560] – Ned
As long as they’re not claiming it’s AI.

[00:28:01.530] – Ethan
Don’t see AI on the page doesn’t mean it’s not buried in the doc somewhere, because there’s another link. Read more in the ML docs that I’ll have to tap into and see what that’s all about. We’ll see what artificial intelligence is happening, turn your Grafana into Skynet. You can do it with our artificial intelligence and machine learning.

[00:28:22.170] – Ned
No you can just hook it into GPT-3 and then it can write the incident reports for you. I think that’s really the integration you’re looking for.

[00:28:31.890] – Ethan
I got another question for you now. You live with the SCOM and the Nagios stuff in the past, and I’ve worked with those systems that tend to be very specific points of information can be monitored, right? It gives you all of the generic stuff that are kind of generically useful. But then if you want to be like in a Microsoft AD environment and target something very specific, you can do that too. Do you have use cases for those products? Are you kind of rethinking how monitoring should be done? The context is. It’s hard for me as an infrastructure guy that’s been my background for years, building systems, installing software on metal, making it do the thing so that it serves up an application quickly and redundantly and stuff. I keep thinking about things that way and I’m wondering, should I not be thinking about things that way? Has your mind shifted on how you do monitoring these days?

[00:29:20.630] – Ned
I was never a huge fan of SCOM to begin with. I just felt the interface was difficult to use. It required an immense amount of tuning out of the box and it wasn’t very smart about that tuning, and it made it very difficult to customize alerts and set proper thresholds and group servers together appropriately and all that. Maybe it’s gotten better, but really what happened is Microsoft shifted their focus over the Azure monitor and building out that product. So I don’t want to get into a big like this is what happened to Microsoft and why their product has changed over the last ten years. Because that’s a whole other podcast. Have I thought about monitoring differently? I think there’s two approaches you can take to monitoring. There’s the very role specific monitoring approach where like you said, you’re a network person, you care about the network metrics and alerts and thresholds, or you’re the application person. So what you care is response times on your app and the client level of happiness perhaps, or you’re a security person. So what you really care about is intrusion attacks and DDoS and that kind of thing. So I think there’s a role specific monitoring.

[00:30:32.880] – Ned
And then there’s more of a holistic monitoring approach where it’s what we really care about is the application. At the end of the day, we’re trying to deliver a service to clients and we need to measure whether or not we are delivering that service effectively. And as long as we’re meeting whatever that number is, then I don’t care about the rest of the stack.

[00:30:53.910] – Ethan
I’m trying to measure. If I’m delivering my application effectively for my client, which may mean you don’t have just one monitoring system, you may need multiple. So, for example, Grafana is going to be very good for measuring what’s happening on that server where that agent is positioned. That’s its perspective. But if I need to know that my European clientele are being effectively served from my Chicago data Center, I need to be monitoring with a different tool, probably entirely from Europe, to understand what that user experience is. So I wouldn’t want anyone to listen to this and think they can just soup to nuts, get it all done with Grafana Cloud. I don’t think that’s the case. I think it’s one valuable tool that is infrastructure and app stack oriented, and that out of the box is giving me very quickly default dashboards that are immediately useful for me. I found a broken plugin that was redirecting clients for me, or redirecting web queries that were coming in to a remapped URL taxonomy structure that I set up. It was broken redirecting people to nothing. I had no idea because it was working when I set it up two years ago.

[00:32:03.050] – Ethan
And then just looking at the dashboards, it revealed that problem to me. It’s not going to reveal a problem to me, like where this server is positioned is just sucking for all of your clients that are in some particular part of the world or whatever else. I might need a different tool for that.

[00:32:20.160] – Ned
Or you just need to know to ask the question. And I guess that’s what it really comes down to is the type of monitoring that you’re interested in is focused on detecting problems based off of metrics and inferring stuff about that based on your knowledge of how infrastructure functions. A more advanced monitoring case might involve actually talking to clients and customers and figuring out what their experience looks like and what matters to them and having that drive what you actually want to monitor.

[00:32:51.510] – Ethan
Did you say talk to clients? What does that mean?

[00:32:55.780] – Ned
Yes. Well, in the context of your world, it would be getting feedback from those who listen to the podcast and use the website and go, did you have a good experience? Are you happy? Would you like to take a survey?

[00:33:10.690] – Ethan
I’m being silly, but of course, I’m just throwing your point that you can look at metrics all day long and infer, but there’s nothing like actually talking to the human consuming service and seeing what their experience is like.

[00:33:23.300] – Ned
Right. So ultimately if you’re delivering a service to humans, it probably behooves you to talk to those humans at more than one point in the process and go, what is it you actually want me to deliver? How do you expect it to be delivered? And are you happy with the way it’s being delivered today and let that drive the rest of your monitoring strategy?

[00:33:43.630] – Ethan
I like it. I like it. Ned. Mic drop right there.

[00:33:46.660] – Ned
Boom. Done.

[00:33:48.730] – Ethan
Well Ned, Coming up next, we have a Tech Byte, which is a short 15 minutes conversation with a sponsor. And the sponsor today is Singtel. Singtel does a lot of stuff related to cloud and cloud networking, and we’ve got several conversations scheduled them over the next several shows. Enjoy this first one in the series, which, as you are and I are recording this now. We haven’t even recorded with Singtel yet, so we’re getting ready to prep for those shows next week. But that’s coming up. Stay tuned to listen as we nerd out with the nice folks at Singtel.

[00:34:19.370] – Ned
Welcome to the Tech Bytes portion of our episode. We’re in a six part series with Singtel about cloud networking. That is how to make your existing wide area network communicate with cloud services in an effective way that maybe your legacy WAN isn’t able to do. Today is part one of six, and we’re chatting with Mark Seabrook, global solutions manager at Singtel about well, who is Singtel? Mark, give us the background.

[00:34:46.630] – Mark
Hey, guys. Yeah. So Singtel, Singapore Telecom. We’ve been around 140 years and we’re basically at the crossroads of where most of the APAC connectivity meets. So we’re in a pretty good spot to deliver our services.

[00:35:03.850] – Ned
Interesting. And what would those services be? What does Singtel have on offer?

[00:35:09.190] – Mark
So basically everything telecom. So anything layer one, layer two connectivity. We have customers that want 100 gig wave, 100 G E line from anywhere in APAC to the States, back to Europe, down to Australia. We do a lot of global Internet rollouts across the world. We have a large MPLS infrastructure, 428 pops across the globe, IP transit. We have a number of IP transit gateways all over the world. SD Cloud Connect. We also do 4G. 5g. We have, for example, 770,000,000 handsets across all of our subsidiaries in the APAC region.

[00:35:56.110] – Ned
Okay, what is a cloud connect? I haven’t heard that terminology before.

[00:35:59.920] – Mark
So we’ve got basic cloud connects in all of the usual places, all of the Equinix centers around the world. We also have a software defined portal called SD Connect. So basically what that is, you buy a Port at any of our locations around the world, whether that be a ten gig, we give you a portal and then you set up VLANs into different cloud targets. So you could have some of it going to AWS, some of it going to Google, Azure, Alibaba, Oracle. A lot of our customers will use that product to hair pin traffic for multi cloud topologies.

[00:36:39.770] – Ned
Okay, so this is sort of the direct connect, express route, all of those services that you hear in the common cloud carriers, you’re providing a direct connection from my point of presence into those public clouds.

[00:36:51.890] – Mark
Correct.

[00:36:52.400] – Mark
You can also set up some of the VLAN to go from pop to pop. So you could have, say, 70% of your bandwidth going into cloud targets and say 30% going to an ad hoc layer two connection to one of your other pops around the world.

[00:37:09.310] – Ethan
But that’s if I’m in a data center where Singtel is offering that service like Equinix you mentioned.

[00:37:15.910] – Mark
Correct. For example, our SD connect, we’re in over 33 data centers in Singapore. We’re in seven in Hong Kong, we’re in Tokyo, Australia, the States, Europe, UK. So, yeah, anywhere where we’re located. However, we do have a lot of customers that we can extend that out via local layer two loops or other connectivity such as MPLS, Internet, et cetera.

[00:37:43.910] – Ethan
So you’ll hand me, as Singtel you’ll hand me a circuit, you’ll go all the way down to layer one, or you’ll get layer one from somebody and hand it off to me. You’ve got a bunch of ways you’ll do that. It can be Internet, MPLS, variety of different IP transits. You’ve got access into a bunch of different clouds that I can from a dashboard kind of a thing. It sounds like plumb into that, including my own multi cloud. But I’m going back to the Sing Part of Singtel is in Singapore. Does this mean I need to be an Asia based customer to take advantage of this?

[00:38:18.860] – Mark
No, not at all. So we have cloud connect customers all around the world. We have IP transit customers all around the world. A lot of our Eline layer two circuits, they don’t even touch Singapore. So, for example, we have customers that want connectivity from the US down to Australia. We’ll do that. The golden triangle for us is Singapore, Hong Kong, Tokyo. But outside of that, we can do anything as far as connectivity is concerned.

[00:38:48.860] – Ned
That brings up another interesting question and something I want to touch on in terms of data governance, because if I’m working with Singtel and I’m setting up circuits across all these different international boundaries, you mentioned China, which could be a concern for some folks. And of course, the European Union. Is there anything that Singtel is doing to help me with my data governance concerns?

[00:39:11.150] – Mark
Yeah. So typically, a lot of our big global customers, they will set up their network in a regional hemisphere, site to hub and then super hub to superhub topology. That way we can keep data within territories, within continents following GDPR, for example, in the European Union and some of the other restrictions that we get as well, including mainland China.

[00:39:40.270] – Ned
Okay. So you have a model that they can follow where they have these hubs, where they could do the traffic inspection, whatever they need to do to make sure traffic is staying in its region and then link those various hubs together. Is that sort of the model that you provide to them?

[00:39:55.030] – Mark
That seems to be the model that a lot of our big Fortune 100 customers are going, especially with the SDWAN. They’re keeping it in a site to hub, multiple hub within a region. So within a hemisphere. So they’ll lump together the US, Europe, Middle East, Asia, and then we’ll link those together at a higher level.

[00:40:17.810] – Ethan
All right, Mark, as a network nerd, I want to understand more of the detail about how Singtel connects me up in the cloud. We don’t have to get into bits and bites and packets, but I have kind of a clue of how AWS direct connect works. If I’m in Equinix, let’s say I’m probably plugged into one of their switches in the rack and they’ll assign me a VLAN or give me a Port that then plumbs me into AWS. And it’s direct and magical at that point, if I’m consuming a Singtel service to connect me to the cloud, is it a similar kind of thing?

[00:40:48.290] – Mark
Yeah, sure. I mean, basically our SD connect, you would buy a Port anywhere in the world at any of our locations. And it’s purely portal based. So once you go into the portal, you’re set up your VLANs, you can point them to any cloud provider that we have connectivity with. So go to an AWS VPC, it can go into Google, it can go into Alibaba, Oracle, Azure. As I mentioned before, we have a lot of customers that use this product for hair pinning, multi cloud environments. So a lot of our customers don’t just use AWS, they use AWS, they use physical data center, and they might use Azure. So in that situation, they could set up VLANs to both the cloud targets plus their physical data center. And we can do a BGP network that will link all three together.

[00:41:54.770] – Ethan
Okay. You put some pieces together there for me. I can actually plumb together all those disparate pieces that make up multi cloud. As you said, whatever public cloud I have presence in, plus my local data center, because I know we’re all going to cloud, but we still have our physical data centers, don’t we? Somehow. But I can plumb all those together in a common network and do just IP packet exchange like I would on any network. And Singtel is the transport is the medium that’s connecting all of those around for me?

[00:42:26.130] – Mark
Absolutely. Yes.

[00:42:27.490] – Ned
I’m curious, what level of assistance do I get from Singtel when it comes to rolling out or figuring out what my cloud network strategy is? Do you just sell me the ports and say, good luck, go configure things, or is it a little more hands on? Is there some more help that you get from the Singtel team?

[00:42:46.350] – Mark
Yeah. So, I mean, typically any of our customers, especially in the Fortune 100 space that are going to buy any kind of cloud connectivity off of us. They’re probably already a customer. They probably already have MPLS Global, point to point networks with us. They’re working with us day to day. We have engineers, product managers, project managers, technical specialists in every country. We will work with them to design exactly what they need as far as the cloud goes, based on what we already know about them, from what they’re doing with us today or what they want to do with us in the future.

[00:43:25.550] – Ned
So, Mark, what you’re telling me is that a lot of these folks who are getting their cloud network together, they’re already customers, which means Singtel already has a pretty good idea of what their business requirements are, how they like to approach networking. And that gives you a bit of a head start in terms of helping them with rolling out their cloud network. Now, are you doing the actual design and implementation as well with your teams?

[00:43:51.290] – Mark
Yeah, absolutely. I’d say this. Most of our customers, you get to know their individual DNA through dealing with their existing network. So you know that some customers are very hands on. Some customers are hands off. They want the carrier to do as much as possible. So we’re very flexible. We’ll work with whatever their constraints are, whatever they want or feel comfortable with. If they want a full hands on approach to cloud connectivity, we would do that for them. If they want to hand the keys over and take care of everything themselves, they can do that through portals.

[00:44:31.490] – Ned
Okay. So it’s really choose your own adventure when it comes to the customer. But if they want that full white glove experience, that’s something that Singtel can provide to them.

[00:44:40.280] – Mark
Sure, absolutely. And we’ve got people all over the world. So like in the US, in the UK, Germany, all over APAC, Australia, even throughout the world. We own Opsis, which is the second biggest carrier in Australia. We have stakes in Bartiers in India, Globe in the Philippines, Indonesian Telecom, and AIS in Thailand.

[00:45:06.890] – Ned
You’ll meet the customer, wherever they are, whatever language they happen to speak, you’ve probably got someone who can help them out with what they need to do.

[00:45:14.880] – Mark
Absolutely. Yeah, absolutely.

[00:45:17.810] – Ned
Well, Mark, if people want to know more about Singtel or more about you, if they want to talk to you, are you a social person? Is there somewhere they can find you?

[00:45:25.890] – Mark
Yeah, absolutely. So any customer, any prospect you can hit me up on LinkedIn under my name, probably the best place, and then we can direct you to the right team to suit your needs.

[00:45:40.040] – Ned
Excellent. And we will include a link to your LinkedIn in the show notes. Thanks for joining us, Mark. And hey, listeners, thank you for listening. This was just part one of a six part series, so we’re going to hear more on building cloud ready networks with Singtel in upcoming episodes. Part two will be in a couple of weeks and we’re going to itemize the way in which a cloud ready WAN is complex.

[00:46:05.030] – Ethan
Hope you enjoyed the bit with Singtel we did. Now, Ned, before we close the show I know you’re doing more magic Pluralsight stuff in 2022. Can you give us a quick preview of what’s coming up? You popular instructor you.

[00:46:16.800] – Ned
Absolutely. I am currently working on a getting started with TerraForm cloud course which assumes you already are using TerraForm and now you want to use their cloud software as a service which is pretty awesome now that I’ve been digging into it. So watch for that course to drop sometime in February of 2022.

[00:46:33.670] – Ethan
Very cool. Thanks to you for listening. Welcome to 2022. I know this is not the first show of the year but I hope things are going very well for you. If you have topics that you want Ned and I to cover on day two cloud, we would love to hear about that. You can hit us up on Twitter. Ned and I both monitor at day two cloud show or Ned’s got his website. Ned in the Cloud.com. You can submit a request to him from there and until then just remember cloud is what happens while IT is making other plans.

More from this show

Day Two Cloud 153: IaC With GPPL Or DSL? IDK

On Day Two Cloud we’ve had a lot of conversations about using infrastructure as code. We’ve looked at solutions like Ansible, Terraform, the AWS CDK, and Pulumi. Which begs the question, which IaC solution should you learn? A Domain Specific Language...

Episode 131