Building blocks of a cloud-native AI data centre – part I
By Werner Coetzee, Business Development Executive at Data Sciences Corporation
The cloud revolution has changed the face of IT operations. Business leaders are looking for platforms that serve a broad user base and deliver non-stop services dynamically while providing performance and the lowest cost.
With HPC and supercomputers becoming more prevalent in commercial use cases and forming part of primary compute environments, new supercomputers must be architected to deliver as close to bare-metal performance, but do so in a multi-tenant fashion. Historically supercomputers were designed to run single applications (as discussed in our previous article, 'Super Compute in Enterprise AI').
Naturally, it stands to reason that a cloud-native supercomputer is needed to deliver on these demands that are fast becoming de facto table stakes for AI in the enterprise. Cloud-native supercomputer architecture aims to maintain the fastest possible compute, storage and network performance while meeting cloud services requirements, such as least-privilege security policies and workload isolation, data protection and instant, on-demand AI and HPC service. It seems so simple!
Cloud-native supercomputing key ingredients (CPU, GPU, DPU)
Cloud-native computing needs vital ingredients to deliver on those requirements. The core of these is CPUs and GPU accelerators. Without getting too technical, CPUs are the brains of any computing platform and handle all the standard management and precision tasks needed to run a server in AI. They have one weakness: they process all their instructions in serial. That's where GPUs come to the rescue. GPUs process tasks in parallel and deal with tasks and instructions that don't have to be precise, but they execute them way faster.
To illustrate, imagine your favourite fast food that you just ordered on your favourite delivery app. Now let's say that 20 houses in your area also ordered from the same fast-food place. If the delivery service only had one scooter, they would have to deliver the food in serial like a CPU, basically going from one house to another. If they spend five minutes per delivery, it will take 100 minutes to complete the delivery. In contrast, if the delivery service behaved like a GPU, then all 20 orders would be delivered simultaneously by 20 scooters, reducing the delivery time by 95 minutes. (This is a hypothetical scenario, please bear with me!)
Now that you're wondering what fast food to get, let's get back into the digital world. CPU with GPU acceleration is key to any AI and HPC success, but these building blocks alone don't meet the cloud-native requirements set out earlier. So, to move from traditional supercomputing to cloud-native supercomputing, we need the next fundamental building block, the data processing unit (DPU).
Dealing with cloud-native tasks
As brilliant as CPUs and GPUs are in delivering the raw performance that AI workloads need, in a traditional supercomputer architecture, cloud-native tasks would have to take some of that processing power to handle the tasks associated, such as shared platform tasks, security and isolation tasks, data protection tasks and many more infrastructure management and communication tasks that make multi-tenant supercomputing work. The trade-off in AI processing power is potentially significant.
This is where the DPU becomes critical. All the tasks required to enable cloud-native supercomputing are offloaded from the CPU and GPU to the DPU for processing.
Think about our fast food example. Imagine the scooter driver had to package the order and decide which scooter the order needs to go to in order to be delivered to the correct house. Well, not only would there be chaos with mixed-up orders, even if they did get it right, it would take time and impact the delivery time to the customer. The answer is to let the drivers focus on driving, navigating and getting to the house as quickly as possible. The task of isolating the orders, packaging them correctly and getting them to the right scooter is offloaded to an order controller at the fast food place.
In a similar principle, by migrating the cloud-native specific task from the CPU and GPU to the DPU, the architecture ensures maximum performance in executing AI and HPC tasks and the security and isolation required for multi-tenant environments that optimises performance and return on investment.
The cloud-native supercomputer needs the third brain to build faster and more efficient systems. With traditional supercomputers, the workloads sometimes must wait while the CPU handles a mundane communication task. This is called system noise. The DPU or third brain allows AI to compute tasks and communication and management tasks smoothly.
Today's AI enterprise needs a turnkey, cloud-native supercomputing solution that can dynamically respond to the demands of its users, delivering resources right-sized to every workload with secure multi-tenancy. Purpose-built for demanding enterprise environments using NVIDIA Bluefield DPUs and NVIDIA Base Command Manager, NVIDIA DGX SuperPOD provides multi-tenancy with safe isolation of users and data, without sacrificing bare-metal performance.
A cloud-native supercomputer, such as NVIDIA DGX SuperPOD, offers innovators access to the highest-performance infrastructure with the same assured security found in the cloud.
Now go collect your food!
Our next article will explore how on-premises cloud-native AI infrastructure complements public cloud AI platforms.
Data Sciences Corporation is a leading IT solutions provider and emerging technologies systems integrator. Contact us for more information about our NVIDIA AI/ML supercomputers for the enterprise.
Watch our video explaining the difference between CPU and GPU here: https://www.linkedin.com/feed/update/urn:li:activity:6853945933231722496/
To learn more about the NVIDIA BlueField DPU watch this video: https://www.linkedin.com/feed/update/urn:li:activity:6856583741754736640/