Search
Follow me:
Listen on:

D2C218: What’s Inside The AI Magic Box?

Episode 218

Play episode

AI and machine learning are being more widely used in IT and elsewhere. Today’s episode opens the AI magic box to better understand what’s inside, including software and hardware. We discuss essentials such as training models and parameters, software components, GPUs, networking, and storage. We also discuss using cloud-based AI platforms vs. building your own in-house, and what to consider when assembling your own AI infrastructure.

Episode Guest:

Fredric Van Haren is CTO & Founder of HighFens Inc. He’s an expert in AI and High Performance Computing (HPC). He has over 20 years of hands-on experience in high-tech and has built large HPC & AI environments for several customers from the ground up. He provides technical leadership and strategic direction to organizations aiming to become leaders in today’s AI market. He’s also co-host of the Utilizing AI podcast.

LinkedIn | X

Sponsor: DoiT

An award-winning strategic partner of Google Cloud and AWS, DoiT works alongside more than 3,000 customers to save them time and money. Combining intelligent software with expert consultancy and unlimited support, DoiT delivers the true promise of the cloud at peak efficiency with ease, not cost. Their technology is backed by deep multicloud expertise in the analytics, optimization and governance of cloud architecture, as well as specializations in Kubernetes, AI, and more. Learn more at doit.com.

Episode Links:

HighFens

HighFens Blog

MLCommons

Transcript:

This episode was transcribed by AI and lightly formatted. We make these transcripts available to help with content accessibility and searchability but we can’t guarantee accuracy. There are likely to be errors and inaccuracies in the transcription. 

Ethan Banks (00:00:01) – Today’s podcast is sponsored by do reduce your cloud spend by improving your cloud efficiency with do It and award winning strategic partner of Google Cloud and AWS. Find out more at DeWit. That’s it.

Ned Bellavance (00:00:17) – Welcome to day two cloud, you awesome human being. Today we are getting into the nuts and bolts of AI. This isn’t going to be architecture. We’re getting into true infrastructure and to help us out with that, we have a special guest, Frederik Van Buren. He’s the CTO and founder of Hyphens Inc, and he knows a thing or two about AI, doesn’t he, Ethan?

Ethan Banks (00:00:38) – Yeah, more than a thing or two. I mean, he is a hands on engineer approach with deep knowledge of both hardware and software stacks that are used to do AI computing, and he understands the business of AI deeply. I had a feeling, even though we chatted with him for, I don’t know, about 45 minutes, Ned. We were just scratching the surface.

Ned Bellavance (00:00:57) – Oh, absolutely. So enjoy this conversation with Frederik van Herren.

Ned Bellavance (00:01:00) – Frederik, welcome to day two cloud. We’re pretty pumped to have you here because we hear, you know, a thing or two about AI. So why don’t you introduce yourself to our lovely listeners and tell them a little bit about how you got into the world of AI?

Fredric Van Haren (00:01:13) – Well, first of all, thanks for having me. So, you know, I has been around for a long time, but traditionally HPC was really the driver, and that was driven by the fact that everything was about coding, not about data. So my background is speech and telecom. So when we talk about speech, we’re talking about text of speech speaker verification and speech recognition. And I did that until as a data scientist, until 2005, until I had what I call the monologue, is when the CEO of the company called me in his office and said, I’m stripping you of all your groups. Are we going to do something new? We’re going to be doing something data driven. I thought that was more of an excuse to lay me off, but apparently this really was an opportunity for me to do something else that today we call data driven research.

Fredric Van Haren (00:02:04) – Right? So speech was really driven in the early days by coding. So if you wanted to do something, you had to code it. While data driven is a lot more intuitive because you can use historical data to build a model that represents the future. So I did that from 2005 to 2016. Some of the projects I worked on was Apple Siri version one, other projects like Dragon Naturallyspeaking, which was a desktop product. And so in 2016, I decided based with some feedback from some hardware vendors on helping the hardware industry understand how they needed to design hardware for the AI market. So it’s not that long ago 2016. So I did a lot of NDA work. The great thing about NDA work is that it’s very interesting and innovative. The drawback is you can’t talk about it, right? So, you know, try to get new customers when when you realize that the only thing you can say, yeah, I’m doing some work for NDA and company X and Y and Z, but I can’t really tell you anything else.

Fredric Van Haren (00:03:04) – So I did that from 2016 to 2019. So in the meantime, I started hiring people from my old team, and we are now working more and more with enterprises. So you can look at us as an organization that is bridging the hardware vendors and the end users. And with our experience and knowledge, we kind of can bridge the two and help both hardware vendors as well as end users. So anyway, so that’s how I got involved into AI. The NDA work was really great from an innovative standpoint. We helped organizations understand and progress and evolve accelerated cards, you know, like GPUs. But enough of that, right? So that’s really a little bit of a background on where I came from and what I’m doing today.

Ned Bellavance (00:03:53) – Well, that’s some fantastic perspective. And when you said, you know, speech to text, my mind immediately went to Dragon, Naturally Speaking, because I used it. And, you know, if we’re going back to that era of time, I’m like, wow. Yeah, I remember using that program and when I was using that program, for me, it felt like it was just a magic black box.

Ned Bellavance (00:04:15) – You know, I put speech in and text came out and it was like pretty close to what actually said, which was fantastic. But I had like no idea what components were involved. And that’s kind of how I understand AI today. So can you expand on that a little bit? What are the primary components inside that magic AI box?

Fredric Van Haren (00:04:34) – Right. So first let’s look a little bit at when we talk about AI, what we really mean. So you probably heard terminology like a model and something like training and inference. So first of all it’s data driven. So the input is data. And so the output is really a model which is a statistical representation of all the data you own. So for example you could have millions and millions of pictures of cats and dogs. And you send that through the model. And you hope that in the end, the AI model can tell you if it’s a cat or a dog. And so if you look inside of a model, there are what we call a parameters.

Fredric Van Haren (00:05:14) – So when you read about. ChatGPT, for example, you would see something like there are 100 million parameters, so what is the parameter and what does that have to do with the model? Well, the model is really a mathematical representation of your data and a parameter is an unknown. So what that means is when somebody says I have a model and there are 100 million parameters, it basically means that during the training phase, you were analyzing the data and you were trying to find and resolve the 100 million unknowns. And so if you think a little bit about math, this sounds like a pretty big task. If you go back to the math days, you know, if you had two unknowns, you needed to have two equations in order to solve it. If you have 100 million unknowns, that means you need at least 100 million equations to solve this. Right? And so because your data is always different, it’s kind of, you know, it goes exponentially. So in short, training is feeding your data into a model, finding a reasonable value for those unknowns.

Fredric Van Haren (00:06:28) – And then once you have that that’s your model. So your model has a known value for each of the unknowns. And so inference is applying those numbers. So in inference for example you show a picture of a cat. And from then on it’s all math. You just do the calculations based on the values you obtained. And that’s how you do inference. That’s very high level. Now when we talk about the components we’re still using data. We still need computing and we still need networking.

Ethan Banks (00:07:03) – Before you dive in any deeper into the hardware components and all of that about parameters, you intrigued me here. Okay. So parameters are a bunch of unknowns. And the more parameters we have, the more unknowns that we’re solving for feeding all of this data into these parameters and resulting in a model, a mathematical representation of all the data. I’m going to be feeding a bunch of different data with different ranges and values, perhaps into the model. So how do we arrive at a specific number, or is it a range of numbers that ends up residing in a parameter? Right.

Fredric Van Haren (00:07:40) – So when you open the covers and you look at what a parameter is, a parameter is a fractional number going from 0 to 1. And sometimes when you go deep, deep in the formulas, the mathematical formulas, you will see that they actually call this a weight. And so each parameter has a certain weight. Some parameters or some indications might have very little input into the outcome. And so eventually the math will say this is for recognizing a cat. This dog feature has a low weight. So you come up with these ideal numbers for the weights such that when you pass along a cat picture, that value will automatically will be diminished, while when a dog picture comes along, the dog feature will will be elevated. And there is a tendency to say it’s a dog versus a cat.

Ethan Banks (00:08:36) – So I can have a model that is designed to help an AI identify this is a dog versus this is a cat. And the parameters would be in that model to help the AI solve that specific problem. So depending on the nature of the problem I want to solve, I may have a greater or smaller number of parameters.

Fredric Van Haren (00:08:57) – The parameters are indication of the complexity of the problem you’re trying to solve. So I’ll give you an example. If you’re trying to build a model for cats and dogs, you might focus on a feature that is specific to a dog and a cat. And so that will come with a certain amount of parameters. But if you say that you want to do the fur, the eyes, the nose, the paws, all of that, every time you ask for more and more, those become parameters that need to be solved. This may be is good. Also kind of to look at the difference between machine learning and deep learning, which is kind of deep diving into the whole thing, because that also helps you indicate how many parameters you’re going to need. But in general, a parameter is an indication of the complexity and how deep you’re trying to solve something, right? If you just want to figure out, for example, the difference between a car and a bicycle, as it’s basic, you could say, I’m going to look for tires, right? If I count more than two tires, I’m going to call it a car.

Fredric Van Haren (00:10:00) – If it’s two tires or less, I guess I’m going to say it’s a bicycle.

Ethan Banks (00:10:06) – So let’s relate this to the capture problems that we’re asked to solve from time to time. Pick all the tiles that represent a motorbike and it could be a close up of the motorbike. Where there’s a handlebar and there’s a gas tank and there’s a tire, and all of those things represent the motorbike. So am I, actually, when I’m doing that kind of problem solving to identify myself as human, I assume I’m actually training an AI in some way, feeding a series of parameters to help it solve something. I think with.

Fredric Van Haren (00:10:33) – The catch thing, I don’t think there is AI involved, to be honest. I think it’s it’s I mean, it is possible, but I.

Ethan Banks (00:10:39) – Always assumed I was helping Google solve an AI problem or something. But anyway.

Fredric Van Haren (00:10:43) – I mean, I would suspect that in this case they would kind of use a frame, right? So there’s an outline of a bicycle. And so the outline is a bunch of pixels.

Fredric Van Haren (00:10:53) – And if those pixels appear in a certain, you know, box, then at that point it will say, oh the bicycle is in that one. Now capture is really tricky because they have 2 or 3 pixels of a traffic light in a in a neighboring area. And then they say, hey, you know, click on all the boxes with the traffic light. So I don’t know about about the Captcha, but I mean, AI is being used all over the place nowadays, right? It’s not limited to to certain areas today AI is all over the place, and we can see it that there is a certain amount of customization around I.

Ned Bellavance (00:11:29) – I failed to prove I am human on many occasions by not being able to correctly pick all the things, it’s because you’re not. Come on then.

Ethan Banks (00:11:37) – We’ve talked about this.

Ned Bellavance (00:11:37) – You’re not you harsh Ethan. But getting back to where I think you were going with things to talk about the difference between the training of the model and inferencing based on that model, I assume they have two different workload profiles and two different stacks that they’re using to to do their work.

Fredric Van Haren (00:12:01) – Right? So I mean, let’s take an example. So let’s assume we are responsible to deliver weather forecasting for the United States. Right. And so we have a lot of data. We have billions and billions of parameters. It’s very complex to identify the weather. So the amount of hardware to process billions and billions of data in a short amount of time because we’re talking about weather forecasting. Right. So if if I want to know what the weather is going to look like in an hour, it doesn’t make sense that it’s going to take me a week to figure this out. Right? So from an infrastructure perspective, I’m going to have to have a decent amount of hardware in order to address and process and analyze that data in a reasonable amount of time. So we can suspect that the training sites, hardware wise, is pretty decent amount of hardware, you know, storage, networking, compute. Now, when we look at the inference sites or the ability to report that weather report, we have a few ways to do this.

Fredric Van Haren (00:13:04) – One way to do is, is that one of us queries the AI model and says, what’s the weather going to look like in an hour? We take that data and we post it on a static website, and every hour we update that. Now, from an inference perspective, one user asking for this report, it’s it’s not a big deal, right? It takes a few seconds maybe, let’s say half a minute to generate a report. So in that particular case the inference is a lot lighter than training. So from a hardware perspective, you know inference is not a big deal. Now imagine that instead of this being static, we expose the model to all consumers, meaning that there might be a few hundred thousand people want to check the weather, and instead of one person hitting it, 100,000 people are going to hit it at the same time. And so that’s the difference between the two. So can you always say ahead of time what inference versus hardware versus training is going to be from a hardware perspective? No.

Fredric Van Haren (00:14:06) – But you can control a little bit on how that’s going to happen. Now let’s take an example with Siri. With Siri you can imagine there’s a lot of data coming in a lot of training happening, but at the same time they have a few million users using Siri at the same time. So it is very possible that their inference setup is very close to their training setup and maybe even bigger. Right? Because inference is sensitive to real time. When you ask Siri something, you don’t want to hear about it in five minutes, right? You want to hear it in less than a second while in training. Training is more batch driven. Training is you’re going to process that data, but that piece of data that’s going to be processed in five hours or in five minutes or in five seconds. It’s no use to you, right? The only thing you want is the outcome, which is the model. And you don’t care what order the data has been processed. So it’s always a very difficult question to answer.

Fredric Van Haren (00:15:06) – Most of the time training is where most of the hardware is, but again it depends on how you do your deployments. So talk.

Ethan Banks (00:15:15) – What software then would the software components for infrastructure and orchestration? Is that going to be things we’ve heard of Kubernetes and KVM and Linux, or is it like special AI focused applications that maybe we haven’t heard of?

Fredric Van Haren (00:15:27) – So in the AI world, we talk about data pipelines. So what is the data pipeline? Data pipeline is is kind of the life of data going from input all the way to a model, all the way to inference. And so there is no one way to do it. It’s multiple stages. And so nobody’s going to tell you we’re using GPUs from the first moment we see data to the moment we deploy. Nobody’s going to say we do that with CPUs. At the same time, the software frameworks and each of the steps or phases in a data pipeline can be different. Think about data. Pipelines can take maybe hours. It can be days, it can be weeks.

Fredric Van Haren (00:16:13) – So when we look at containers and Kubernetes and bare metal and virtualization, it is really depending on the piece of the data pipeline you’re dealing with. Some of the organizations that have been doing AI for a long time, they will they will go with bare metal, right? They’re not going to use virtual machines because virtual machines, there is a there’s an overhead. And although the overhead is small, but if you have a thousand servers and your overhead is a few percent, that is, that’s a lot of percent overhead. So I can tell you that historically, bare metal was really popular. That shift with innovation and new software frameworks in AI are driven more and more to containers. And so another way to look at it is. Although Nvidia sells GPUs. What really sells the GPUs for Nvidia is the software, right? You can have a very expensive piece of hardware, but you don’t if you don’t have the software to do it. So what is Nvidia doing for AI is they went to all the different verticals.

Fredric Van Haren (00:17:24) – They delivered software that is specific for that vertical, and they delivered that in a very consumable container. So they have a container repo. And you can download if you want to do let’s say image recognition or any kind of other AI. You can download a container from them. So you will see that in data pipelines, more and more organizations will download and consume those containers from Nvidia and from others, right? Nvidia is not the only one, but Nvidia seems to be the most popular where container driven pieces of the puzzle for data pipelines. It’s really, really important. Now, if you use rancher or you use AKS or anything else, it is less relevant. From a hardware vendor perspective. It’s more relevant on what you use, right? At some point you have a container. You can deploy that container. Kubernetes is really popular. So some of the tools like orchestrator right. So Kubernetes also it’s an orchestrator. You know there’s an orchestrator of the orchestrator if you want in data pipelines right. So the orchestrator of the data pipeline orchestrates or uses an orchestrator like like Kubernetes.

Fredric Van Haren (00:18:38) – It doesn’t have to be. But it all depends on on the comfort zone of the end user. So it’s a long story to say that there are many choices you can do still do bare metal that’s more traditional, although I do see containers definitely being dominant. And if it’s Kubernetes or some other orchestrator on top of it, that’s a that’s more customer choice.

Ned Bellavance (00:19:03) – Gotcha. Okay. Yeah. Because when you mentioned Nvidia, the only thing I know about Nvidia when it comes to software is Cuda. There’s this Cuda thing that they created and they made it very accessible for developers to use. And that gave them sort of an edge. I didn’t think about the fact that they may have also created these sort of customized environments for different industry verticals, but when you put it that way, I’m like, yeah, why wouldn’t you do that? Go out and say, what’s the ideal environment you want to run your models in here? Let’s build that for you. So you can consume our hardware a lot easier.

Fredric Van Haren (00:19:35) – Right. And I think it’s one of the reasons why hardware vendors have to do that is because AI has become a time to market thing. Meaning, you know, ten, 15 years ago and maybe not that far ago, people would do AI on a single GPU, right? And they were doing AI, and that was a leg up compared to the competition. Now everybody is doing that. And so it’s not sufficient that you say, oh, I have a GPU and I’m doing AI now. There’s this massive competition or race if you wish to deliver higher accuracy based on more data, with more complex algorithms. And it’s a race against the time, right? So if you look at ChatGPT, you know, they they had thousands and thousands of GPUs. They ran that for three months. And now there’s a competition to do it less than three months. You know, can we do it in six weeks. Eight weeks. And it’s not just a reflection of the the market, but it’s also very important for hardware vendors.

Fredric Van Haren (00:20:35) – The hardware vendors are trying to deliver those containers as a way to reduce the time to market. Right. So maybe a shorter way to do it is Nvidia is trying to help people not to reinvent the wheel. Right. If the algorithm has been proven to recognize cats and dogs, there is no need for you as an individual to do that all over again.

Ned Bellavance (00:20:56) – Right? Okay. And when we’re talking about the container scheduling, it sounds like there is an orchestrator living above Kubernetes that is kind of leveraging Kubernetes or something similar as a scheduling algorithm. Go run these containers and spray them across this hardware in a in a way that makes sense and bin packs the most efficiently or as efficiently as possible.

Fredric Van Haren (00:21:20) – Right. And that’s because the data pipelines are in different stages and the stages are different. And the hardware requirements can also be different in each stage. Right? In the early stages, it’s all about data. It’s less about compute. It’s more about data processing the data, cleaning the ingestion, the storing, the labeling.

Fredric Van Haren (00:21:39) – And then there is a next phase where it’s more compute intensive. So, you know, AI is not a it’s not like buying an Oracle database system, right? Where you you have all the known parameters and a few questions or how many users and how much data. There are so many different things. And and maybe another way to look at it is AI is a living thing, right? So AI is about continuous learning. So you might have a. Petabytes of data today. But a successful AI project will be deployed and when you deploy it, you get new data. And so that new data needs to be processed again. So you can even say that a successful AI project will have a more challenging hardware environment as time goes by. While the the the model that is not being used, you know, you’re not you’re not going to get additional more data. So you’re not going to have as many challenges growing or scaling than a very successful. From my background, when I started in 2005, we were doubling capacity every 18 months.

Fredric Van Haren (00:22:45) – And when I say capacity, I mean everything, right? Storage, compute, networking, except for people, the amount of people working on the project. And of course, but from a hardware perspective, it was all about scale, right? So so, you know, sometimes we talk to people that say, oh, you know, I have a system and I’m good for three years, you know, the traditional corporate mentality, right? I buy something for three years and then done and over I that’s a very, very, very bad strategy.

Ned Bellavance (00:23:14) – Right. And that doubling that you mentioned the increase in hardware. So what type of hardware are we talking about here. Are these like single and dual socket one you servers with just as much memory crammed into in GPUs. Or I guess the GPUs changes the form factor of the hardware. Because do you want to put as many GPUs in a single server as possible, or are you trying to spread that out across multiple servers? What’s the hardware architecture look like for that.

Ethan Banks (00:23:41) – Today’s sponsor do? It can help you with your cloud challenges. Maybe you want to maximize your cloud use while controlling your costs. Perhaps the issue is balancing resource utilization while delivering agile IT. Maybe it just can’t get good support from your cloud partners. Do it can help an award winning strategic partner of Google Cloud and do it works with over 3000 customers to save them time and money. Do it combines intelligence software with expert consultancy and unlimited support to deliver cloud at peak efficiency with ease. The Do It team knows multi-cloud, cloud analytics, optimization, governance, Kubernetes, AI and more. Work with duet to optimize your cloud investment so you can stay focused on business growth. Learn more@dewit.com. That’s dot it.com.

Fredric Van Haren (00:24:33) – So anything that is compute intensive. You always try to accelerate. Right. So when the the IBM NXT came out, there was an empty socket on the motherboard and that was called the coprocessor. So for people really needed to do some number crunching, you know, when what was it that 4.77MHz.

Fredric Van Haren (00:24:51) – Right. You still needed a little coprocessor, you know, to crank it all up and move forward. It’s still the same concept, right? So you still want to do things as fast as possible. And the data pipeline, because it’s about data. In certain cases when you move data around, you don’t need a GPU to move data around, right? You can do that with something else if it’s a single socket or dual socket or quad socket or Raspberry Pi if you’re at the edge. It all depends on your workload. But once you get to the number crunching piece right, the piece about the 100 million unknowns and it’s all about math. It’s all matrix operations, right? And you do them over and over and over, because each time you show the training model a new picture of a cat or dog, it has to do the whole thing all over. And so how do you accelerate these things. And so you want to use a processor that is very efficient in doing those specific math operations.

Fredric Van Haren (00:25:55) – The CPU as we know it is very, very flexible, but it’s not that great if you have to do the same thing over and over. There are processors that are a lot better for that. And so the reason why we’re using GPUs for processing today sounds a little bit weird, but the the history is it’s a graphics processor unit. So every computer came with a little processor. And the only thing that processor had to do is to take data and put pixels on the screen fast enough that our eyes wouldn’t see it flicker. Right. So that’s the task of a GPU. It only has to do one thing. It’s put those pixels on the screen. But the data is always different. So if you extrapolate that to what we’re having today is that’s exactly what Nvidia is doing. And Vidya took the concept of the GPU doing one thing, doing it really well at a high level. So GPUs get faster and faster. So that’s the other thing. So every two years Nvidia comes out and says, oh guess what.

Fredric Van Haren (00:27:01) – You know, same form factor. Well maybe a little bit more power, maybe the same power. But it processes data twice as fast. Right. And it’s a horrible thing to hear. If you just spend $10 million on buying GPUs and they’re being phased out. I mean, we all have the feeling with an iPhone or a new Samsung phone, you know, where you just bought one. And then there’s an announcement, oh, there’s another one, a better one, faster one, and so on. So in general, CPUs are still very key. You know, they are they are kind of the super glue right in between the stages moving data around. But the number crunchers are what I would call the accelerator cards. I mean the GPUs are really from Nvidia as what you know, but AMD has won, you know, a series of accelerators. Intel has FPGAs if you’re familiar with FPGAs and Asics. And the reason why there are so many different flavors is certain jobs like analyzing text, like for large language models or analyzing images or analyzing videos, there are different algorithms, and those different algorithms can be accelerated in a different way.

Fredric Van Haren (00:28:19) – So what I’m saying here is, if you know what you’re doing and you look at what you need to do from a text or video audio, you can pick the one you want. Nvidia has done a great job to say it all doesn’t matter. We do it all, which is true. But if performance is key to you, you will pick the one that works the best for you, right?

Ned Bellavance (00:28:42) – You’re talking about specialization. The CPU is a general purpose thing, so it’s not super efficient at any one given thing, but it can do a lot of stuff. We step down to the GPU that’s more targeted at doing a very specific task. And now you’re talking about, well, I want you to break that down and be more specialized in this very, very specific use case. And so what you’re saying is there are certain chips that are designed for very specific use cases. Does that have anything to do with like a tensor processing unit? Because I know I’ve heard that terminology as well. I think that’s from Google, the TPU.

Speaker 4 (00:29:19) – Right.

Fredric Van Haren (00:29:20) – So so now we’re going deep diving a little bit in the terminology. So you heard me talk about matrixes right. And so a matrix if you go back to your high school days a matrix can have multiple dimensions. Right. So you can have a zero dimension, which is a single term. Right? So you you really don’t have an array. It’s a one man. Right. So it’s just just a constant. And then you have a tensor that’s we call that tensor dimension zero. And then there is tensor tensor one which is where you have only one element in a one dimension. And the tensor two is two different dimensions. And so when when hardware vendors say tensor cores, it basically telling you, hey, our hardware really knows well how to do matrix operations. It’s efficient to do that. The way you can look at it is you can implement in software how to do matrix calculations. But we all know that if you implement such a thing in hardware, it’s always faster than you doing it in software.

Fredric Van Haren (00:30:25) – Right. And so this is exactly what it’s about. The hardware vendors are trying to tell you, hey, we know how to do matrix operations really well. If you have to choose between somebody who doesn’t do it and somebody who does do it, and speed is all what you think about. Then tensors the tensor course. That’s what it’s all about.

Ned Bellavance (00:30:48) – How important is memory in these systems? Because we talked about CPUs and GPUs. But I imagine that data’s out to reside somewhere. So how much memory is going to go into these systems, and how important is that memory to the functioning of the the models and the training?

Fredric Van Haren (00:31:06) – Right. So let’s let’s look at it from the other angle. So we have GPUs. These GPUs have a certain amount of processing capabilities. Now we all know that GPUs need data to process right. And so the data needs to be ready because you don’t want the GPU to be idle. So the fastest way to make sure that the GPU never runs out of data is to put that data in memory.

Fredric Van Haren (00:31:34) – Now it’s like a domino effect, right? How does the data get into the memory? Well, the data gets into the memory from the storage devices. So now you need to have very fast, low latency storage devices that can feed those thousands and thousands of GPUs to the point where there is enough data in the memory such that the GPU can read freshly new data from the memory on the GPUs. And so there are workloads and algorithms that are heavy on on memory, meaning that they require a lot of memory in, in, on the GPU because they have to query a lot of that data in order to make a decision, as opposed to making a decision on less amount of data. But it’s really a domino effect, right? So it’s it’s the you start with the capabilities of your accelerator because the capabilities of our accelerator defines, you know, the latency, how much data you need to have in your GPUs. And so when you design, you always start with the accelerator. And then you go all the way down.

Fredric Van Haren (00:32:45) – Now remember you double you you know a successfully I will double every 1818 months to two years. So in reality, you know, it’s it’s it’s like you’re chasing your tail the whole time. Right. It’s it’s, it’s it’s not like, oh I have X amount of gigabytes on board and, and you would say just like with PCs, you would say I’m going to buy a maximum amount of memory. But the memory on the GPUs is also expensive. Right. And so a GPU nowadays is already expensive. So there is a dilemma between am I going to spend more money on doubling the amount of memory on my GPU, or am I going to buy more GPUs? Right. And so those are decisions you kind of have to make based on the frameworks and the software frameworks you’re using.

Speaker 5 (00:33:38) – Yes.

Ethan Banks (00:33:38) – Also gets into some of the work that’s being done by the Ultra Ethernet consortium, keeping systems primed up with data and getting Ethernet as a transport. So if data is coming from one system to another for computing, Ethernet is viable as a transport to minimize loss and keep everything very low latency.

Ethan Banks (00:33:56) – So you have no idle GPUs. Just popped in my head when you were talking about we got to keep the memory buffer full so that we can keep them crunching numbers.

Fredric Van Haren (00:34:04) – Kind of bringing it back to the previous conversation about the CPU is the CPU is the one that connects to the the bus and everything else. If you look at an NVMe drive, you know, an NVMe drive can actually move data around without the help of the CPU. That’s done on purpose, because talking to the CPU is adding latency. You know, it’s kind of funny we say that right. At some point the CPU was considered like the fastest thing around. But if you put eight expensive GPUs in a box and you have 1 or 2 CPUs, those CPUs are actually the bottlenecks. So you’re going to do whatever you can to eliminate that, right? So that’s why on the storage side we’re talking about NVMe. Why on the the networking side we’re talking about DMA. So this DMA was always you know, direct memory access was always in all the computers.

Fredric Van Haren (00:34:56) – But then they came up with a protocol for DMA to move data from one box to another box without the CPU really knowing too much about it. Right. Because if you ask the CPU for an opinion, the CPU will give you an opinion with the necessary latency. So can we eliminate the CPU? So can we eliminate the CPU? Not really. Not at this point. Just because all the hardware designs are around the CPU and you can’t if you remove the CPU, remove the bus. I mean, you remove pretty much everything, right? But from an AI perspective, the CPU on the motherboard is probably the weakest link, right? Because it’s blocking you from from making a lot of process. So you will see more and more technology doing things to kind of. Acknowledging that the CPU is a necessary evil, but it will do its best not to rely on it for any transaction.

Ethan Banks (00:35:51) – You mentioned NVMe drives. What is the typical configuration there where it’s just devoting a single NVMe drive or maybe a few attached to that bus? Or are we dealing with storage arrays, NAS devices, anything like that for persistent data versus what’s stored in memory? Yeah.

Fredric Van Haren (00:36:09) – So when you look at a training there’s typically your CPU servers and GPU servers. So if we look at GPU servers we talked about the GPUs. The GPUs have some memory. Now the memory needs to deliver the the data as fast as possible to the GPUs. The next step is to have local NVMe drives as a buffer as an additional buffer. Right. It’s like a layered system. Okay. So now the local NVMe drives have the next batch of data that it can feed to the GPU or to the memory of the GPUs. And then the local drives on your server are attached in two ways to to storage. Because we’re talking about petabytes. We’re not talking about, you know, a little NAS we’re going to attach to the server. Right. We’re talking about a little something a little bit more complex. And so the two models we see and apply is one is the traditional San. Right. So you have a cluster of servers, all based mostly. Most of them are based on InfiniBand because of the low latency, you know, difference between InfiniBand and Ethernet, performance wise is relatively the same latency.

Fredric Van Haren (00:37:25) – InfiniBand still the winner. So San is one option, the other one is hyper converged. So hyper converged basically means let’s assume you have 100 servers and you kind of use the local NVMe drives to create the file system across all of them. Yeah, there are some benefits to it from and again think latency hyper converged is is coming up because of you know we always talk about do you bring the compute to the data or the data to the compute or the storage to the compute and so on. Hyper converged kind of solves that a little bit, in the sense that you have both in the same box to a certain degree. But if you ask and look around, most people will probably gravitate around a sense solution.

Ned Bellavance (00:38:14) – Yeah. Recall one of the challenges with Hyperconverged was the you’re still dealing with data locality. So you may have like three copies of a of a piece of information stretched across your eight node cluster. If the thing requesting that information doesn’t happen to be on one of those three nodes, it still has to pull it from the other nodes.

Ned Bellavance (00:38:34) – So you’re going across whatever medium is interconnecting that hyperconverged cluster. And maybe that’s InfiniBand. But in my experience it was usually Ethernet that was connecting all of them.

Fredric Van Haren (00:38:45) – It depends how it’s done. Right. So the funny thing about NVMe is because they don’t rely on the CPUs is if you use DMA, you could have a situation where if the local drive has to use the CPU, that’s actually going to be slower than if you talk to a neighboring server over DMA that doesn’t use the CPU. So a little bit tricky. You know, it’s hard to believe, but you can end up depending on your configuration. You can end up in a situation where staying local isn’t necessarily the fastest way. As as crazy as it might sound.

Ethan Banks (00:39:20) – Another comment on the InfiniBand versus Ethernet latency. InfiniBand still the king, but due to the expense of InfiniBand and the increasing difficulty finding people that know how to run an InfiniBand network, I know again, the Ultra Ethernet consortium is working on that problem, updating Ethernet so that you’re getting the same performance characteristics you would get out of InfiniBand.

Ethan Banks (00:39:41) – But but being able to operate it on a on an Ethernet environment. So I be interested to see how the EEC does over time. But if those standards come to fruition, it means there may be less reliance on InfiniBand over time because you can get the same result with Ethernet using Ultra Ethernet specifications to get that same delivery characteristic for your data.

Fredric Van Haren (00:40:02) – Yeah. No doubt, no doubt. I mean, I think InfiniBand has been around for a long time. I mean, the cost of InfiniBand in the early days was because of the licensing fee, right? Not not not the cost to implement it. It was really a licensing fee. As we know, Mellanox is is the number one InfiniBand provider acquired by Nvidia, and Vedere spent on decent amount of time helping Mellanox build tools around it. But I agree with you the complexity of managing InfiniBand is higher than Ethernet cost price. I definitely want to debate you on that. I’ve seen solutions where Ethernet was more expensive, and I’m talking hardware components where Mellanox is selling those switches at a very reasonable price.

Fredric Van Haren (00:40:49) – The biggest issue today with InfiniBand is availability.

Ned Bellavance (00:40:53) – Seems to be supply chain issues all over the place right now in terms of AI, because it is so ascendant and so many organizations are doing their best to get some sort of AI program going, and they go and to order a GPU and they’re like, oh, it’s like a six month or a 12 month wait just to get some GPUs in. Regarding organizations that are looking to sort of build up some sort of internal AI practice, we’ve talked about all the different hardware options, and it sounds like there’s a lot there’s a lot of decisions to be made here in terms of your data pipeline, the various hardware and software stacks you might use. And it sounds like it all comes down to understanding the nature of your workload and its profile. Are there some official, like measurement tools or metrics that folks are using to kind of make those hardware and software decisions?

Speaker 4 (00:41:46) – Yeah, definitely.

Fredric Van Haren (00:41:47) – There are. There’s an organization delivers the Perth tool. And so Perth is an organization that does baseline metrics.

Fredric Van Haren (00:41:56) – So they they say for this particular piece of hardware for this storage solution with this particular data sets, this is the performance you should expect. Now when you do baseline metrics, it’s only useful if you are doing something similar than what the benchmark is benchmarking, right? It doesn’t make much sense to use a benchmark for video. If the only thing I want to do is benchmark large language models, right? It’s it’s different, but I think. I think in AI in general, benchmarking is really key for the simple reason that a model is a statistical representation, right? Even though when the model expects you to say yes or no and you say yes internally, the model is not going to say it’s 100% yes, right? It’s most likely going to say something like, I think for 82% it’s a yes, right? And a no might be 7%, and maybe the rest might be. I actually had no idea what you just said. Right. And and so so those are kind of the, the challenges that, that you have to deal with and are problematic to solve.

Ned Bellavance (00:43:17) – Really, before you think about designing any kind of hardware or software solution, you have to know what you’re actually going to be doing with your AI program, right?

Fredric Van Haren (00:43:26) – And people have difficulties believing that. But AI is really trial and error. When you look internally what the math is doing is when you feed data to a training model, they cut it into batches, right? And so a batch is just a unit of data. And then it’s going to do all the calculations based on that batch. Now the first batch that is given to a training model, the first thing it says it’s like blindfolding somebody and asking them to throw darts at a dart board. Right. You absolutely have no idea where it’s going. So if you ever did one of the tutorials about recognizing handwritten digits or anything like that, you will see that the first output of the first batch is pretty much horrible, right? Because it’s it’s really the unknown. And it’s very difficult to ask somebody who has never done or implemented an AI solution is to say, what is this going to look like? Right? So we always recommend when we talk to people, it’s to go in the public cloud.

Fredric Van Haren (00:44:32) – And try stuff, you know, at the small scale, right? Try stuff. Figures out what sticks. Right. And then whatever. What sticks? You can go on premises and then buy some hardware. Or even stay on the public cloud. In the public cloud if you want to. And then once you figure it out, you can scale. But it’s very, very difficult to figure out what to do unless it’s very similar to something that some other people have done, right. So, for example, if you would start self-driving car business today, you have an idea where to start, because a lot of people have thrown stuff at the wall and you’ve you don’t have to reinvent the wheel there. But it’s it’s not easy. It’s not easy. And a lot of people make mistakes right from that perspective to say, and, you know, we have to give Nvidia marketing kind of some credit there, you know, because when people start with AI, they say, oh, what do I do? And then they go to Nvidia and Nvidia will say, hey, you want to do AI? That’s really easy.

Fredric Van Haren (00:45:33) – Just buy Nvidia.

Speaker 4 (00:45:34) – GPUs.

Fredric Van Haren (00:45:35) – And then the customer says, but how do I do efficient AI? And Nvidia will tell that’s an even easier solution. You just buy more people.

Speaker 6 (00:45:45) – Right? Right.

Speaker 4 (00:45:47) – So is.

Fredric Van Haren (00:45:48) – It wrong? It’s it’s not wrong if the GPU is the right thing for you, but it created like an even in the market kind of a problem, right? Because as I mentioned earlier, people are now buying GPUs not to do AI, but they’re buying GPUs from a time to market perspective. So people are organizations are hamstring GPUs, right? They’re buying GPUs because they know that if they buy them, they can use and to improve and go faster. But it also means that their competitors won’t have access to those GPUs. And so Nvidia is kind of in this in this catch 22, do we make more GPUs. But at the same time we’re developing a new GPUs. So we don’t want to make too many of the old GPUs while we’re working on the new GPUs. And then the market is saying, oh, I want to do ChatGPT.

Fredric Van Haren (00:46:39) – I also want to build a large language model. So I’m going to buy 10,000 GPUs. And then you’re surprised that if you want to buy one GPU, they basically tell you, well, it’s between 6 to 9 months.

Ned Bellavance (00:46:53) – So that’s great advice. I mean, what cloud really excels at is the ability to experiment without a large upfront expenditure. And so for organizations that are trying to get started, start with the cloud. Figure out what your needs are. And then if the cost and efficiency stuff make sense, bring it on prem.

Speaker 6 (00:47:13) – Or just keep.

Ned Bellavance (00:47:13) – It in the cloud. I think we could talk about AI for another like 4 or 5 hours, but unfortunately we don’t have that kind of time and we only have you for a set amount of time today. So I think we’re going to wrap up here. But before we do that, Frederic, is there anywhere people can go to get more information about AI from you? Do you have a particular blog podcast video that you like to point people at?

Speaker 4 (00:47:39) – Yeah, so.

Fredric Van Haren (00:47:40) – I do a decent amount of presentations and I also write blogs and that’s all on the company website which is hyphens. So hyphens is h I, g, f e and s. Com and on the front page there are links to the most recent postings. But you can find all all kinds of blogs and even interviews where we talk about basic AI and all these components. And you can also find me on LinkedIn under Frederick V heron.

Ned Bellavance (00:48:12) – Awesome. Well, Frederick, thank you so much for being a guest today on day two cloud. And hey, listeners out there, virtual high fives to you for tuning in. If you have suggestions for future shows, we want to hear about those suggestions. You can hit us up on Twitter at Day two Cloud Show or go to our website day two cloud and fill out the handy request form. If you like engineering oriented shows like this one. Visit Packard pushes slash. Subscribe all of our podcasts, newsletters, and websites are there. It’s all nerdy content designed for your professional career development.

Ned Bellavance (00:48:46) – Until next time, just remember cloud is what happens while it is making other plans.

More from this show

Day Two Cloud 215: Highlights From The Edge

Today's Day Two Cloud covers highlights from a recent Edge Field Day event. Ned Bellavance was a delegate at the event and will share perceptions and insights based on presentations from the event. Topics include a working definition of edge, the...

Episode 218