Building blocks of a cloud-native AI data centre – part II

Issued by Data Sciences Corporation
Johannesburg, 14 Feb 2022

Beyond the brain

We took a brief but enjoyable tour of the three compute brains found in modern AI supercomputers in our previous discussions. These are the CPU, GPU and DPU. These brains work together to provide one of the fastest data analytics platforms commercially available to enterprises today. However, the brains are just a part of the platform; we need to go beyond the brain to build a comprehensive cloud-native AI data centre!

Layers upon layers

We must explore the three key building blocks that complement the brain. We will discuss them as layers that build up to a complete platform. As already mentioned, the NVIDIA DGX A100 platform is a 5 Petaflop, Megamind, compute platform that houses the three brains as discussed above. This is the first building block or layer, known as the compute layer.

Network nervous system

Next, we need to connect multiple brains to scale the compute power to match our workload and data demand. To do this, we need to have a sort of "nervous system" network that allows messages to move between the compute brains. In IT, we call this the networking layer. Yes, not very original but most definitely practical.

The networking layer in AI supercomputing can get complex, but we will stick to the high-level concepts for this blog. Essentially, we need to build high-speed data transfer networks that allow all the components to talk to each other and move immense amounts of data from point A to point B, or C, D. In the supercomputer world, we will rely on two types of networks. First, the ubiquitous workhorse of the internet and all-conquering data switcher, commonly knowns as Ethernet. Secondly, the ultra-fast, built for extreme data sizes and parallelised algorithms, aptly named InfiniBand.

The Ethernet network's primary function is to handle all the everyday traffic between computers and connects the entire AI supercomputer platform to the enterprise's regular local area network (LAN).

The InfiniBand network handles all the heavy lifting data traffic. It moves vast amounts of actionable data between the computers and ensures the fastest time to travel to all parts of the compute layer. The InfiniBand network also has a unique function whereby it can directly interact with one of the compute brains, the data processing unit (DPU). This means that the InfiniBand network extends that data computing layer into the network layer; this capability is essential in AI supercomputers.

These two network technologies are used to create two or even three super-fast "nervous systems" that facilitate all the data traffic movement between the compute and third layers, which we will delve into next.

Data is brain fuel

Data is not ethereal; it does not lack material substance. Data can be likened to water in a way. Like water, data can flow and move. It can be stored, frozen or locked and evaporate if you are not careful. The point is, data has mass.

Vast amounts of data need to live somewhere. Therefore, it needs to be stored. Hence following in our tradition of unoriginal but functional naming conventions, IT gives you the storage layer. Yes, you guessed right, the storage layer for our cloud-native AI supercomputer platform comprises its own set of brains. A nervous system like networks and a stomach, yes, I may be stretching my body analogy a bit far here, but bear with me.

Technically speaking, we store data on data drives. There are many types of drives that will require a complete blog series of its own. But for our purpose, we will use the term flash storage. Flash storage or SSD-based storage is the fastest data platform commercially available now. As seen in our previous layers, speed is a golden thread that ties all our components together.

So, how does it all work? Well, the AI supercomputer is designed to process vast amounts of data. We must change the data's state from being stored to being moved quickly, and enterprise-grade storage platforms are designed to do just that. The same InfiniBand nervous system network that carries data between compute brains can also fetch the data from the storage platform. The brain demands the data at a phenomenal speed. This means that we need super-speed Flash storage to feed the brain! (Somewhere in here is a vague but valuable connection with the stomach converting food into energy that the brain/body needs to function optimally.)

As Data Sciences Corporation, data is in our blood and we offer various Flash-based data platforms that are uniquely suited for this type of work. For example, we partner with Pure Storage FlashBlade and VAST Data to provide the best-in-class AI supercomputer storage layer.

We have spent the last two articles exploring the various hardware building blocks to our cloud-native AI data centre. Hopefully, we have given you a high-level insight into what goes into an AI supercomputer.

Data Sciences Corporation is a leading IT solutions provider and emerging technologies systems integrator. Contact us for more information about our NVIDIA AI/ML Supercomputers for the enterprise.

For more information about AI servers for the most complex AI challenges, visit: https://datasciences.co.za/nvidia/.

To enjoy our videos on Going Beyond the Brain: follow this link: https://www.linkedin.com/feed/update/urn:li:activity:6859488966761046016/.

Building blocks of a cloud-native AI data centre – part II

Werner Coetzee, Business Development Executive at Data Sciences Corporation