Search
Follow me:
Listen on:

Day Two Cloud 117: How Akamai Helped Transform IBM Cloud Console’s Architecture (Sponsored)

Today on the Day Two Cloud podcast we have a sponsored show with Akamai and an Akamai customer, IBM Cloud. When IBM  rebuilt its Cloud Console from a monolithic application to microservices, the company turned to Akamai.

Akamai was essential in helping IBM Cloud Console improve application performance while also supporting the routing, failover, and availability across six global data centers.

Akamai is the world’s largest distributed edge compute platform. To deliver and protect your digital experiences worldwide, tap into Akamai for its unrivaled intelligence, performance, and scalability. Because Akamai has 10x the locations of the nearest competitor (that’s over 4,000 locations!) you are always closer to your end user with Akamai – significantly cutting down on latency. With EdgeWorkers, your development teams can focus on building while letting Akamai take care of the challenges of scaling globally.

Our guests are Pavel Despot, Sr. Product Marketing Manager at Akamai; and Tony Erwin, Senior Technical Staff Member/Architect at IBM.

We discuss:

  • The difficulties of moving from a monolith to microservices
  • Akamai’s approach to improving IBM Cloud Console performance
  • How Akamai accelerated non-cachable content such as APIs
  • Developing the routing logic to support a microservices architecture
  • Working across six global data centers
  • Migrating the application while also keeping the service alive
  • Akamia’s Web app firewall and DDoS protection
  • More

Show Links:

Akamai.com/packetpushers

@akamai – Akamai on Twitter

Pavel Despot on LinkedIn

IBM Cloud

Automatically load balance auto scaling services via Akamai APIs – YouTube

Evolving the IBM Cloud console with microservices: A Node.js success story – IBM

Global IBM Cloud Console Architecture – Tony Erwin

TonyErwin.com – Tony Erwin’s blog : www.tonyerwin.com

@tonyerwin – Tony Erwin on Twitter

Tony Erwin on LinkedIn

Transcript:

[00:00:06.590] – Ethan
Welcome to day two cloud. We got a sponsorship with Akamai today. Akamai is chatting with their customer, IBM Cloud. Yeah, that IBM Cloud, the one you’ve heard of and their product, IBM Cloud console. The thing that you would interact with if you’re consuming IBM Cloud service, they had to rebuild this thing soup to nuts and went from monolith to micro services.

[00:00:27.840] – Ethan
But, Ned, not just because microservices are so cool

[00:00:31.230] No, and they were literally repainting the airplane and redesigning the airplane while flying the Darn thing, and they managed to get it done. They managed to land this sucker. And what was really important about it was they weren’t just doing it because it was some new cool technology. They weren’t adopting Kubernetes for Kubernetes sake. They were trying to solve real technical challenges they were having with their Monolithic application by adopting microservices by going to the edge and adding this complexity for real, tangible benefits.

[00:01:02.720] – Ned
And that’s just I love to see complexity being added for a good reason, as opposed to a fashionable one.

[00:01:08.680] – Ethan
So please welcome our guest nerds for the day. Pavel Despot, senior product marketing manager at Akamai and his customer, Tony Erwin, senior technical staff member of the architecture team over at IBM Cloud. So, Tony, I want to lead off the conversation with you here. This conversation is about the IBM Cloud console, and it’s migration. You did some refactoring. You moved it to a new platform. Okay, that’s a lot. There’s a lot going on there to help you listeners understand the challenge. First of all, what is the IBM Cloud console app do. And what were the big drivers that were forcing you to do with this migration?

[00:01:46.820] – Tony
Sure. Yeah. Thanks, Ethan. The IBM Cloud console is the Web front end to IBM Cloud. And IBM Cloud provides IaaS and PaaS functionality and a bunch of managed services for AI and IoT, etc.

[00:02:03.360] – Ethan
This would thing I as an IBM Cloud consumer would be interfacing with to consume IBM Cloud services.

[00:02:07.680] – Tony
That is correct. That is correct. There’s also CLI and API, but this is really where people start. We started off with a number of years ago. Our first release was a totally monolithic application Java server side, and we’re using a single page app on the front end with the Dojo toolkit. There are a lot of things that were good early on, but we really struggled with some of the monolithic. Some of the aspects of the monolithic architecture of code was fragile, easy to break things resiliency problems, difficult to deploy updates hard for other teams.

[00:02:45.150] – Tony
We wanted to make this more open system for IBM across IBM to plug into. That was difficult. We were kind of locked into a technology stack, and performance was a big issue for us, so we knew we needed to find a better way to scale and handle these things. And so we started looking to microservices and refactoring. One of the first of many replatform and refactoring activities was starting to break that monolith down into microservices.

[00:03:13.900] – Ethan
I was going to say this sounds like all the classic monolith kinds of challenges, especially when you go to scale. And so you answered my next question. You wanted to move to microservice. So, Pavel, I want to turn kind of the same question over to you. So you end up being the platform that IBM is looking to to host cloud console. How did that all fit in?

[00:03:32.680] – Pavel
It started very much like a commerce kind of approach, meaning a commerce site kind of approach, meaning that performance, especially given monolithic app, this was very much a common issue with them not being distributed as most of them aren’t. Performance was a huge issue. And even though this didn’t necessarily have maybe the same abandonment issues. If my onload event doesn’t fire enough quickly or if my screen doesn’t paint, but it was directly associated with customer sat to your point. Ethan, if you go in there and interact with the CLI with the docs and it’s pokey, people tend to not be particularly happy, especially when you’re trying to set up five VMs and a bunch of security groups and so forth.

[00:04:23.770] – Pavel
So that’s how we really got started an approach, and we took the usual kind of performance approach to it, which was for static content that can be cacheable, cache it, distribute it across by running it through the CDN, it automatically gets distributed across thousands of locations around the world. That helped with the static content. The next thing we had to look at was other non cacheable types of content. What can we do there? Api calls, right? As Tony mentioned as we started splitting things off API calls auth these things generally aren’t cacheable, so we started looking at ways to accelerate both the API calls and to help the UI render more quickly, starting really to dig in there and get the user to be able to start interacting with the page as quickly as possible. And then I think the last piece, the third piece, as we were getting into in earnest in the splitting of micro services, is just all the routing, which became a really core piece. That routing logic, that failover. How does one define performance?

[00:05:32.120] – Pavel
Where do you want it to go? Can I send it anywhere? Are there Geo restrictions? Is GDPR a thing? Spoiler, it was. So that was how we started, right? I wouldn’t say purely performance, but we started really focusing on that. And then we move to really help support all the work Tony and team, were doing to develop all these micro services from a routing failover and availability perspective.

[00:05:56.900] – Tony
So we had our Java monolith in a couple of data centers at the time, and we just wanted to put our node apps in the same data centers AWS two or three at the time and kept expanding. But we also had to run. We couldn’t just stop and rewrite the monolith and start over we had actual users we had offering managers and the execs at IBM wanting new function as well. So we had to find a way to balance, you know, keeping what we had kind of running alongside our new micro services.

[00:06:36.450] – Tony
And we kind of did that for really a couple of years until we finally got rid of our monolith completely. But the short answer is, still everything deployed to IBM cloud. We changed at some point along the way. We changed from Cloud Foundry to Kubernetes using the IBM Cloud Kubernetes service. And so that was also kind of a rearchitecture. But ultimately, today we’re running in like nine to ten IBM cloud datacenters using Kubernetes.

[00:07:11.200] – Ned
And one of the things that we know since Pavel’s here is that you as IBM decided to go with Akamai to host at least portions of the Cloud console. And I’m curious why you chose that. You had stuff in your data centers. And I think of Akamai as a CDN primarily. So you said you didn’t have CDN. Okay, that makes sense. But was there more to it than just the CDN component of Akamai? And maybe Pavel, you want to jump in on some of that as well?

[00:07:43.390] – Tony
Yeah, just to say we didn’t change where, we didn’t start hosting anything on Akamai. Akamai became more of an overlay, really on top of of the hosting that was done on IBM Cloud and our first interest, as I mentioned a few minutes ago, the performance, we were very concerned about performance of all our downloading, all of our static resources and things. That was a big reason we started looking for CDN capabilities. And so we started working with Pavel and team and as we got into it. We found Akamai had a lot of cool things Besides CDN.

[00:08:26.630] – Tony
I’ll let Pavel talk more about that. But like the Kona security suites and to help with DDoS protection and all that kind of stuff and Pavel, I don’t know if you could talk better about all the cool things that Akamai provides.

[00:08:44.060] – Pavel
Yeah, you’re absolutely right. It definitely started around performance. We really looked at it in two ways. One was just the absolute load time, right? Because understood that this isn’t quite your usual commerce site where you directly are tying abandonment and cart click through and all that kind of stuff to performance, but at the same time, go to a console and try to start up a bunch of clusters and issue a bunch of CLI commands and get some documentation and have it be pokey and see how happy you are about that general service, right?

[00:09:16.530] – Ethan
It’s okay. So not an ecommerce site, I’m not buying a pair of shoes. I’m consuming IBM Cloud Web services through the console. Performance still matters. Cart abandonment is potentially a thing if it’s too slow. I see.

[00:09:28.880] – Pavel
So while I’d say maybe not the direct KPIs matter, we looked at it very much as a user experience thing. So of course. Right. To your point. Having started off the CDN, the low hanging fruit, given that we initially started with more concentrated data centers is like, hey, look, we got to get the static content out there. So that was the flip it on and whatever we could cache. And of course, there’s requirements on can you cache this, can you not? Is the script public domain? Is it all those different things?

[00:10:01.040] – Pavel
But once we work through, that was obviously from a [inaudible 00:10:03] standpoint when you’re looking at a user experience problem, that’s the first thing we looked at.

[00:10:07.960] – Ethan
Pavel when you were talking about those static images, that’s bread and butter CDN stuff. Now, did that content live in just the ten data centers that we mentioned earlier, or is that actually spread globally? And it kind of didn’t really matter so much what data center came from.

[00:10:26.510] – Pavel
The latter. It was initially when we had Dallas, and I think London came next, and I think Sydney, initially it was two or three data centers. Where those data centers were that monolith serving up all that static content. It did matter by virtue of kind of putting, quote, unquote the CDN in front of it as soon as somebody requested it. It’s there and populated and distributed. So back to your question. I think it was Ethan, back to your question. Where was application? The second you put in Caching, right. Especially 4400 location distribution caching, you have some content there automatic automatically.

[00:11:06.320] – Ethan
Wait a minute. Now, that’s interesting, because when you say static content on a website, I’m thinking there’s a bunch of kind of objects that are pretty static throw on a website with pictures and some scripts and things like that. But you’re talking about some of the guts of the application could have been distributed that way as well.

[00:11:21.920] – Pavel
Potentially. Yeah. Certainly the low hanging fruit was the static images and CSS. At the time we didn’t start with that because the next thing we did was start looking at HTML optimizations for performance. It was a pretty hefty framework. So there were a number of things beyond the caching that we could do to the UI to make it render faster. So we went through a whole set of tuning things there. They were generally called at the time the the term that everyone loves to use was front end optimization techniques, JavaScript minification, all that other kind of fun stuff. So that was the next step where we went for performance.

[00:12:01.950] – Tony
We had a pretty good amount of JavaScript running on our pages, so you can kind of think of if you’re caching JavaScript out there somewhere, you can kind of think of at least the JavaScript being part of your application that I guess now is hosted outside of IBM cloud both places, really.

[00:12:20.090] – Ned
And that front end optimization you’re talking about. Did that require code changes by IBM to the cloud console? Or is that something you could take what the cloud console was giving you from the origin, transform it a little bit on the Akamai side and then distribute it out the clients?

[00:12:36.700] – Pavel
The latter that was using the kind of initialing of edge compute when you’re not just going to say proxy cache. Admittedly in more infant stages, but look at it and go, oh, OK, I can minify on the fly, or I can start doing image adaptation in a world before WebP and some of these other formats. This started looking at, hey, is it compressible? If it is, then maybe and I’m on a slow connection, which we can also detect through logic. Compress that image or if it’s a PNG, it turn into a JPEG that kind of thing. So you’re starting to see the inklings of maybe not compute in the traditional sense where I package up some code in a container or a function and deploy.

[00:13:21.940] – Pavel
But even back then, there was that logic to back to your point to not have to make Tony change the code. And the model is because that was never going to happen for the reasons that we just heard 10 seconds ago. So we couldn’t really change anything, especially not until the common UI was split out and then Rainbows and Unicorns and everything. But at the time and there were a few things that just didn’t work out. Things like we tried asyncing some JavaScript to help your onload timers, because again, before the recent Google timers, that was what you looked at looked at onload and times interactive, and we found that some of those techniques just wouldn’t work with the structure of the site with the client side library. So we just had to abandon it.

[00:14:08.940] – Ned
Okay. To what degree were you giving feedback to the IBM team as they rearchitected their application, saying, hey, you know, with our service and the way that we can optimize things, if you make these code changes, it will actually improve the performance of your app.

[00:14:24.740] – Pavel
I think we worked pretty closely together. I mean, I think by the time Tony brought us in right here, he had a vision. He and his team already had a vision of like, hey, no, this monolith is going to be slowly carved up into pieces, but definitely on the performance side initially is. How do we do this? Because, like we said, caching is easy. Well, it’s easy when you have a bunch of servers to distribute the stuff everywhere. But there’s a lot of other things. So we talked to a lot about that.

[00:14:55.570] – Pavel
There are a lot of back and forth like, hey, can you change this script like, hey, when we async it or when we defer the execution, like, the whole page, it doesn’t render and they look at and go, yeah, we can’t do that. So there was a lot of that back and forth. But then later, I’d say, especially during the services, during the splitting up, that’s when we really work together to get into the routing. And like, hey, move this. Unite the host names. That example, Tony, that you mentioned earlier with slash catalog and having all these proxies.

[00:15:26.360] – Pavel
How do we manage all that? Because remember, we started with Dallas, London and Sydney, and now it’s like ten data centers, and I don’t even know how many clusters.

[00:15:38.750] – Tony
And you know one thing related to what Pavel is saying there, in case it’s not clear. Once we had Akamai in place, every single request that went to a console host name goes through Akamai first. Right. So this includes the static resources. It also includes all of our API calls. So basically Akamai sounds too simple, but becomes a proxy for our the stuff we’re hosting an IBM cloud. Essentially, we’ve got this big proxy out there that can do do a lot of cool things.

[00:16:10.170] – Ethan
So I got a question here about all of this. I’m listening to this process. It’s a live app cloud console needs to be accessible by the users. You’re doing a major architecture back end revision. You’re moving components around to different parts of the planet effectively as different things are broken off, turned into microservices and redeployed. How did you keep all of this live while migrating this? Seriously, how did you actually pull that off? Because in my mind, this would be like, no, we’re going to have a flag day or a hard cut over day, and we’re going to move to the new thing, and it’s all going to be great.

[00:16:43.010] – Ethan
But this gradual migration thing sounds impossible.

[00:16:45.980] – Tony
It wasn’t easy. And with that all the things inherent about being on the cloud and microservices. And sometimes I joke if we had known everything we were going to encounter along the way, that maybe we just wouldn’t have done it. That’s not entirely true. But we definitely learned how hard we didn’t know what we didn’t know at the time, and we learned how hard it was. And Pavel can talk some, too, about, like, Akamai has a staging network and those sorts of things. But we did a lot of POCs.

[00:17:20.500] – Tony
Our first set of we first had microservices and started having the monolith and micro services running separately. We would have a separate deployment of that, for example, really run side by side the other one. But we weren’t sending real user traffic to it yet. Right. So we were able to kind of test and see it was doing what we wanted to do. And then by the end of that fast forward, we end up basically ended up just changing where the host name pointed to instead of pointing to the original monolithic app. When we felt the microservice beginning was better, we basically just flip the host name that’s that makes it sound more simple than it was. Perhaps.

[00:18:01.780] – Pavel
I think though to your point from the routing standpoint, it is way more, where you have the way harder job then the routing, because to your point, right? This globally distributed proxy. If you wanted to switch slash catalog, not to keep eating this example to death. But sorry if you want to swing slash catalog over, right? You could just literally blink out your origin online and set up another set of Kube clusters or whatever you want and just point the entire world to the other thing. And if it didn’t, you rolled it back and that service didn’t even have to exist right at the edge, you could even just say failover.

[00:18:42.780] – Pavel
You know what? I’m sorry, page. Not that anyone wants to fail whale, but worst case scenario, you could mitigate that versus if you didn’t have that additional layer. I liked your term there, Tony. If you didn’t have that additional layer and your elastic IP or whatever, it just blinks out of existence or your ELB’s gone, it’s gone until that comes up. And until you roll it back until you update your DNS versus client DNS never really changed. You would have to be an extremely, extremely astute and security minded person to even start figuring out that that origin.

[00:19:23.090] – Pavel
Sorry for the CDN term, but that catalog where that service was changed and that’s kind of the nice thing, right? So it makes it gives him more flexibility to swing traffic over. Now, of course, the back end of that and in synchronizing services and any persistent data is a different nightmare. That all Tony ruminate over, but at least that one wasn’t an issue as much. I think the other thing that I found is just from my perspective, I think the way we always had the environments, your prod, dev, staging and using that in conjunction with the staging and production networks of the edge, I think that really worked well because essentially we always had the lower environments where we tested everything out and even then, because you don’t want to break staging on people.

[00:20:16.940] – Pavel
So if you did anything on Akamai, right? Okay, fine. We didn’t mess anyone up. Let them test do the whatever test they need to do, copy it over to the other environment, do the same thing. And then when you’re ready, just activate in production and say during this maintenance window, you will just globally activate this new Akamai configure whatever. And if you want to go back, then you just go back similarly as you did load balancing. So I think the use of the higher layer environments helped us find. I know we found a few issues like that.

[00:20:48.596] – Ned
Right.

[00:20:48.650] – Tony
And in terms of maintenance window, I guess I would point out our goal was always zero downtime during all of this. Sometimes you go to a website and say, well, we’re under maintenance for the next hour or whatever, and that was maybe more common back then, but you still see it today. I’ll get email from such and such website that they’re going to take down time. We never wanted to take downtime and that was our big goal with the host name stuff we talked about here. We were able to do that without assuming we didn’t break the Akamai config when we deployed it to production, which I guess if we admit it, we may have done that a time or two, but otherwise it should just work.

[00:21:38.390] – Ethan
Well, in theory, you said host name and using DNS host names to deal with the cut over which I’ve used that method a bunch of times, too, which is fraught with peril because of a lot of times. It’s as simple as cache whether or not TTLs are honored by the DNS client and so on. So a point of clarification here when you’re talking bout host names and changing host names, are you talking about when you were cutting over a service, keeping the same host name and updating the IPs that might respond for that host name or actually using a brand new host name that was pointing that service to somewhere in Akamai and dealing with that in code, perhaps so that they the user or client would never have to deal with that.

[00:22:21.150] – Pavel
The client would see the same thing. They would never see a change in host name. Clients change IPs all the time based on where edge networks route and where you’re fastest and all that. But that would never have changed. What was changed was just a new endpoint DNS or set of IPs, where you go into the edge and say, okay, as of now, when I click. Okay, start sending it to this. Well, we recommend FQDN as best practice, but start sending it to this endpoint. Right. Let’s not hard code our IPs, that’s a step backwards, but update your FQDN programmatically or through the UI, and that’s where it goes when you click. Okay.

[00:22:59.370] – Ned
Right, so the client is not involved in that routing decision. As far as their concerned, nothing has changed, but you’re handling the routing on the back end and just following whatever the update is from IBM. So it’s more programmatic than you’re waiting for a DNS host name to cache out.

[00:23:14.380] – Pavel
Correct. Because to your point, Ethan, I am convinced that people just remove the code of DNS TTL adherence in every product ever.

[00:23:26.300] – Ethan
Well, exactly. Yeah, it’s been so painful. You try to bleed off your connections to the old service. You can light up the new one and cut everybody over to it. It’s like, why are they all sticking to the old service? Do they not honor TTLs? And the answer is no. Actually, they don’t.

[00:23:39.170] – Pavel
They don’t. And then especially if you’re trying to do only layer three-four DNS load balancing. But topic for another podcast, perhaps.

[00:23:51.260] – Ned
So you have this application. It’s very complicated. You’ve gone through the migration process, and I’m curious. And now we can kind of get into the architecture of what’s going on on the Akamai side. Walk me through what happens now when a client requests a connection to the cloud console. How does that whole connection process start? And what are all the kind of layers that it walks through?

[00:24:12.990] – Pavel
Sure. I think I like the term proxy, because that is a technically accurate term of what it does. It proxies the connection. The difference between having a node JS or an nginx in your VPC is functionally. Still a proxy. But just move it down the street from me and from you and from everyone else. And think of the technical. Think of how it works is the same way when I type in cloud dot IBM dot com. DNS happens, of course, and not to be too elementary. But the short story is that all these 330,000 servers you hear of you will get an IP for one of those.

[00:24:53.600] – Pavel
Right? Me here in Cambridge, Tony in Austin, you folks, and wherever you are. So that’s always the first step. And that would be the same step if your nginx happened to be in your VPC, right. You get an IP address hosting results. If you look at any Akamai website dig on Akamai dot com, you’ll see just like most CDNs, you do that through DNS. But the short story is user connects to the edge. Now back when you start, which is very close, especially especially prod very close to the user.

[00:25:26.810] – Pavel
Now when we started, of course, what would happen? I would just ask for my geo cities JPEG. I had asked for the HTML and all that good stuff and the page loaded or even for videos right, my MP4 is I would say get me NFL dot com slash video MP4 and then I get it and I’d wait forever. But it was a lot faster because I was only getting it close. That still holds true today. Of course, we have to deal with things that aren’t these static objects.

[00:25:57.460] – Pavel
But that first part is still the same. And that the fact that that first part is still the same is really what let us do this migration in the flow because the user connects to us to Akamai to the Edge server and we will allow connections to any server you tell us, we give it certs. All usual Http stuff. And then from there we have the ability to cache the request, forward it, scan it for security. If it’s an image, we can optimize the image because again, it’s a proxy, just a very distributed one.

[00:26:29.820] – Pavel
So back to your question of how does it flow? User connects to the website DNS resolution, users browser or client. Right. We shouldn’t say browser, not just a CDN. Same thing holds true for my phone when it’s using Akamai to make API calls. It connects to that Edge server, and that as a proxy. With all these functions, we can handle that request to route it or respond on the behalf of the service, and increasingly even run some compute to respond. But that’s how the client looks at it.

[00:27:02.300] – Pavel
All of this with Https that goes without saying. So before I get pillared when I say http by default that is https. TLS 1.2 or higher because we are all security minded people.

[00:27:14.960] – Ned
Okay. And so once the client has connected in right, I’m at the Edge server and then the Edge server is going to make probably at least hopefully intelligent decisions about where to send the various requests for the different components of the application, because I’m guessing we got ten data centers to choose from. Does the Edge server then say okay, you’re connecting from Austin and I see there’s a data center in Dallas, so I’m going to route you there as opposed to send you over to Sydney. Is that part of what Akamai does or is that something that IBM tells you to do?

[00:27:50.940] – Pavel
So it is part of what Akamai does based on what we’re trying to accomplish. Given performance. That is obviously the best way to go because you can say alright, I know you’re here. Here’s the best performing data centers out of the umpteen that you’ve configured, automatically send it there unless the health check fails and reroute and do all that good stuff. However, not all services micro though they may be, are stateless. Sometimes we have to consider state. Sometimes we have to consider sessions and stick services to a certain place, so that changes slightly what you just described because maybe on the first request to go, yeah, persons in Austin, Dallas is the best one to send it.

[00:28:43.910] – Pavel
But next request and the third and any subsequent request. If there is a logic that hey, this is a session, even if Austin, even if Dallas becomes not the best performance anyway, it has to go there. That’s where the session has been established. So basically my point in saying that is we would work with all the different individual services to say, hey, look, if flat out performance is all you need. Great. Yeah, we can do this and really easy. Just put in your IPs and we’ll sort it out.

[00:29:15.760] – Pavel
But does the application allow you to do that with traffic? Because if it doesn’t, we still did it. But then you take the approach of pick the first pick anyone out of session, it goes to the fastest. Anyone in session as indicated by a cookie or a query string parameter whatever used to indicate session. Usually one of those two. Then stick that to the target because you don’t want the person to have to necessarily relog in or have state be replicated across your data centers or whatever. Whatever you have to do that was driving that farm.

[00:29:51.410] – Ned
That’s an interesting question. Could you as a future development, a future improvement, move that state from Dallas to the Edge? So now if a new session does have to be created because Dallas is down, not that that would happen, right Tony, Dallas doesn’t go down, but.

[00:30:10.420] – Tony
It doesn’t.

[00:30:11.430] – Ned
If six backhoes cut all the fiber to Dallas in a massive.

[00:30:17.680] – Tony
The F-5 tornado comes through or something else.

[00:30:20.650] – Ned
Why not? Is that something where then the Edge server could be holding that session that state and then just route it to a different location? Assuming the logic there on both sides of the app.

[00:30:32.980] – Pavel
You bring up a topic that is extremely dear to me. So at the time we had no such approach. We had no way to distribute any sort of data structured or unstructured across the edge, aside from caching, of course, which we will agree to just call objects and not pull on that thread. But we had no other way to do that short of putting a service somewhere in a cloud provider, which being the edge was silly. These days, though, and I’m not necessarily suggesting it works for everything because you still have to call it.

[00:31:12.940] – Pavel
But there is the ability to have a globally distributed key value store across the edge. So before we said it’s proxy and it handles Http and WAF and DDoS and image manipulation, all that. Additionally, there is this notion of a key value server globally distributed key value store, and some of the places that we’ve seen interest in. Some of the applications of that is for session ID. Think access Control, think content access media folks. People have streams that sort of thing. People have logins to kind of centrally centrally manage the auth, so the auth endpoint that you have in your cloud, authenticates you once the edge sees and goes okay, endpoint says, fine store that it’s fine. So you don’t have to go check in with it constantly. Right. And that’s what requires is that ability to distribute your key, your user key, and your auth token and whatever else. But at the time, no, we couldn’t do that.

[00:32:19.880] – Ned
That brings up a great point, though, because if you can do that, if you can say, hey, IBM service, I want you to trust the Akamai edge points. Those edge points are allowed to hold a token or an auth, authentication. Now you can cache that at the edge. And if I have to connect to a different location as long as it trusts, that same Endpoint says, yup I see you have the token. That’s a valid token, and I trust you as a source for that token. Now the user or a client doesn’t have to log in again because that token is being cached for them. That’s a pretty cool use case. I like that.

[00:32:56.890] – Pavel
And then we’re already going down the path for some layer three four stuff because we’ve already got especially once you enable security and everything we’ve already got in the allow lists on most layer three four firewalls right on ingress people will put the IP addresses of the edges in their allow list. So you already got some degree of trust established, especially if you’re doing that allow/deny list. Some people also use mutual auth, but who doesn’t love managing Certs, so you already kind of going down that road.

[00:33:34.270] – Pavel
I think just like you’re seeing with the functions and some of the other stuff beyond caching, it’s obviously going to go up the stack. You’re doing key value pairs. So obviously you’re going to do some compute and it’s never going to be my XXL five GPU instance. But that’s not what this architecture is for.

[00:33:55.680] – Ethan
So, Tony, I got a question for you knowing what you know now, having done this whole migration, are there some things you would have changed about the migration process, handled some things differently, maybe.

[00:34:08.280] – Tony
Yeah, probably.

[00:34:10.680] – Tony
I know it definitely brought some new challenges along. I think we really underestimated pipelines. For example, CI CD pipelines, you make a change, you redeploy the whole thing. And we really had one team contributing. If you have teams from throughout IBM contributing and want to be able to deploy their micro services on their own schedules, changes the pipeline quite a lot, right. Compared to how it worked with the monolith. Nowadays, I feel like we have a really, really good pipeline, but that was something that we probably underestimated earlier.

[00:34:53.600] – Tony
Monitoring and troubleshooting was a big issue now, instead of if there’s a problem in the monolith, it was like, okay, let’s go look at the monolith logs and look at the monolith code. But then once you have microservices, it’s like which microservice is having problems. And one thing about being a console is you’re kind of the Canary in the mine shaft. Sometimes you’ll start experiencing problems in the cloud before for even the responsible team start seeing them. And certainly if someone has a problem, they come to the console team first and say, Why is the console broken?

[00:35:32.550] – Tony
I can’t log in or I can’t get my list of resources and being able to, it became a matter of self preservation, really, to get a better system in place, to be able to just simple things. Like, like looking at all of our inbound and outbound requests and see. Okay, this API call to this cloud service is failing every time. Right? It’s not a console problem, so to speak. We need to get that service team working on it. And of course, when those things would happen, we would be like, Well, what can the console do to better be better resilient to that API call failure.

[00:36:09.690] – Tony
So it was always trying to improve those things a lot because we had so many developers from across IBM we call UI plugins. We would call them, say, the Kubernetes service or the Watson or one of those teams wants to come in and provide a UI. How do they do that? Right. We got to the point where we couldn’t hand hold every single one of these teams that wanted to to do this. So we ended up having to essentially develop a developers toolkit inside of IBM, where we have a sample apps and best practices and the pipeline I mentioned and test frameworks and all that stuff really evolved over time as the need again, something that maybe we underestimated early on.

[00:37:05.010] – Tony
But of course, you can only do so much at once as well. Right. So all this was very agile and just tried to make things better and better as we went.

[00:37:16.900] – Ned
It’s certainly like you had a lot going on already. It sounds like you still have a lot going on. In terms of services and features that Akamai now has versus maybe what they had a few years ago. Is there anything that you would have architected or approached a little differently, like, say, if that key value store had been available when you first started this migration? Are there some things you think you might have adopted or done differently as part of the migration and re architecting?

[00:37:46.580] – Tony
Yeah. I was thinking about the key value store a little bit while Pavel was talking about it because I don’t think we’ve talked extensively about it in the past. I mean, that is kind of interesting because we try to make our applications very stateless as much as possible. But as you alluded to there’s, like, user tokens and stuff sometimes just for efficiency sake, we want to keep it in a local cache. So when a request comes in, it’s like, okay, now we can use that user token to call an API.

[00:38:16.270] – Tony
If we weren’t somewhat sticky, sessions like Pavel mentioned, we’re just flip flopping around, then we were able to handle that eventually. But it would be a little bit more overhead if I switch from Dallas to Sydney for some reason. If that were going to occur, we’d have to call the back end APIs to get that token again. Essentially, we could use cookies and that sort of thing. So it just added over overhead. I can’t think of a whole lot else we keep in session state, but I know we’ve been very careful not to we don’t send our tokens to the browser, even would we put them in an external service? I’m not sure, but it’d be interesting to talk about.

[00:38:58.220] – Ned
I’m sorry if I just made more work for you, Tony.

[00:39:01.450] – Tony
Yeah.

[00:39:02.860] – Pavel
Yeah.

[00:39:03.730] – Tony
Everybody always has more work for me.

[00:39:09.140] – Ned
Well, guys, this has been an absolutely fascinating conversation. And if I’d like to give you a chance to summarize all the goodness that we had in the conversation down to a few quick key takeaways for our audience. Pavel why don’t you get us started.

[00:39:23.600] – Pavel
Sure. So I think the first main thing is that there are a ton of difficulties in migrating both operationally, both logging and moving some of that routing logic. Moving some of the visibility to the edge really helped us with those difficulties. Beyond that, as the micro services grow and there are more of them, and we needed health checks, additional complexities. That logic where we made sure, where we managed all that load balancing, managed that failover for reliability was another big one, and then ultimately what we were able to do is I think set up a really good architecture for the progression and the growth down this whole architectural path of microservices.

[00:40:12.540] – Ned
Awesome, Tony, any final thoughts from you?

[00:40:16.110] – Tony
Yeah, I would say we’re definitely we’re very happy with the micro services architecture, the migration to the micro services architecture. I think it addressed a lot of our problems. We had limited blast radius for changes, so it’s easier for people to not break things, put in code changes and not break things, increased resiliency, better performance, flexibility and technology. And I could go on and on to benefits we saw there. And I guess the biggest thing, the relationship with Akamai has been really good throughout all this. But I will say the biggest thing that really improved our experience and gave us the most uptime benefits.

[00:41:00.300] – Tony
Microservices helped with that. But using Akamai to do the geo load balancing and so we could fail over from an unhealthy data center to a healthy one. That was just huge for us. So thanks, Pavel, for the help in making that happen.

[00:41:18.520] – Pavel
You got it.

[00:41:20.940] – Ethan
Well, guy, it’s been a fun conversation. The microservices conversation that when it works and then the problems that come up and how you solve those always keeping it real, keeping it real here on day two cloud. And Pavel question to you if people want to dig in and find out more about this stuff, the things that Akamai has to offer and so on, where would you send them?

[00:41:40.520] – Pavel
I would send them to our recently launched shiny new Akamai dot com slash packet pushers site to find out more there’s links, there’s even, I believe, a couple of reference architectures and some of the write ups that we did with Tony and cloud IBM dot Com.

[00:42:00.230] – Ethan
Excellent. And if you go to the show notes that’ll be at day two cloud dot IO and packet pushers dot Net because we post a show in two places to make it even easier for you to find things, you’re going to find that link Akamai dot com slash packet pushers. You’ll find some links about cloud IBM dot Com.

[00:42:14.780] – Ethan
You’ll find a link to YouTube where Pavel is speaking on automatically load balancing and auto scaling services via Akamai APIs. Tony’s got a blog that he wrote about a lot of this migration process that they did and so on. So there’s plenty there for you to dig in and find more information. If you’re a Twitter kind of a person, you can follow Akamai on Twitter and Tony Erwin is up there at Tony Erwin. These folks were on LinkedIn and again, all that in the show notes at both days two cloud dot IO and packet pushers dot net.

[00:42:43.500] – Ethan
Our thanks to Akamai for sponsoring today’s show. And hey, listen to you virtual high five for tuning in. You’re awesome if you have a suggestion for future shows we would love to hear them hit either Ned or I up on Twitter. We are both moderating the at day two cloud show Twitter account which you should follow. And if you’re not a Twitter person, hey, go to Ned’s fancy website. Ned in the cloud dot com. Fill out the form there and let us know that next show you’d like to hear.

[00:43:07.250] – Ethan
Did you know that you don’t have to scream into the technology void alone? The Packet Pushers Podcast network has a free slack group and it is open to everyone, visit Packet pushers dot net slash Slack and join. It is a marketing free zone for engineers to chat, compare notes, tell war stories, and solve problems together. Again, that’s packet pushers dot net slash slack. And until then, just remember, cloud is what happens while IT is making other plans.

Episode 117