Revolutionizing Data Center Efficiency with the NVIDIA Grace Family

The exponential growth in data processing demand is projected to reach 175 zettabytes by 2025. This contrasts sharply with the slowing pace of CPU performance improvements. For more than a decade, semiconductor advancements have not kept up with the pace predicted by Moore’s Law, leading to a pressing need for more efficient computing solutions.

NVIDIA GPUs have emerged as the most efficient way to meet these growing compute needs. Their ability to handle complex tasks and parallel process workloads enables them to maximize the work done per unit of energy consumed, making them 20x more energy-efficient than traditional CPUs for a wide range of data center workloads, including AI, high-performance computing (HPC), data processing, and video and image processing.

As more applications become accelerated, innovation in CPUs is needed to maximize data center efficiency. Accelerated computing requires innovation across the full stack, from hardware to software, platforms, and applications across multiple domains.

NVIDIA has consistently delivered breakthrough GPUs and networking. But while GPUs excel at parallel workloads, CPUs are still required for serial tasks. To fully unlock acceleration in the modern AI data center, we need a new CPU architecture with the following features:

High per-core performance

Massive memory bandwidth

Low power consumption

Sufficient cores to run the services needed

Great connectivity for tight GPU and CPU collaboration

The NVIDIA Grace CPU is the first CPU designed by NVIDIA to power the AI era:

72 high-performance, power-efficient Arm Neoverse V2 CPU cores

NVIDIA Scalable Coherency Fabric (SCF), which enables fast data movement between the CPU cores, memory, and I/O

High-bandwidth, low-power LPDDR5X memory

900 GB/s coherent NVLink Chip-to-Chip (C2C) connection with NVIDIA GPUs or CPUs

The NVIDIA Grace CPU powers multiple NVIDIA products. It can pair with either NVIDIA Hopper or NVIDIA Blackwell GPUs to form a new type of processor that tightly couples the CPU and GPU to supercharge generative AI, data processing, and accelerated computing.

The NVIDIA Grace CPU is also a world-class standalone data center CPU. And it pairs with a second NVIDIA Grace CPU to create the NVIDIA Grace CPU Superchip. Offered in a compact single, two-socket module, the Grace CPU Superchip delivers 2x the performance in the same power envelope as leading traditional CPUs.

Next-generation data center CPU performance efficiency

Data centers are constrained by power and space, which means that infrastructure must deliver maximum performance at the lowest possible power.

The NVIDIA Grace CPU Superchip delivers outstanding performance, memory bandwidth, and data-movement capabilities with leadership performance per watt to deliver generational gains in energy-efficient CPU computing for the data center. It also provides versatility and performance for foundational data center workloads such as microservices, data analytics, graph analytics, and simulation.

Figure 1. NVIDIA Grace CPU Superchip performance compared against x86 2S servers

Figure 2. NVIDIA Grace CPU Superchip performance per power (CPU + memory power) compared to x86 2S servers

NVIDIA Grace Superchip 480GB of LPDDR5X, AMD EPYC 9654 768 GB of DDR5, and Intel Xeon Platinum 8480+ with 1TB DDR5. OS: Ubuntu 22.04 Compilers: GCC 12.3 unless noted below. Power for energy efficiency includes CPU + memory measured power.

Compression: Snappy (Commit af720f9a3b2c831f173b6074961737516f2d3a46 | N instances in parallel) Microservices: Google Protobufs (Commit 7cd0b6fbf1643943560d8a9fe553fd206190b27f | N instances in parallel) Seismic Data Proc: SPECFEM3D four_material_simple_model; HPC SDK 24.3 CFD: OpenFOAM Motorbike | Large v2212 MD: CP2K RPA 2023.1 Weather: WRF CONUS12km x86: ICC 2024.01; Climate: NEMO Gyre_Pisces v4.2.0 Weather: ICON QUBICC 80 km resolution Data Analytics: HiBench+K-means Spark (HiBench 7.1.1, Hadoop 3.3.3, Spark 3.3.0; Grace: NVHPC 24.5, x86: Intel 2021.4) Graph Analytics: The Gap Benchmarks Suite BFS arXiv:1508.03619 [cs.DC], 2015.

Results subject to change.

As problem sets grow, the ability to scale out to multiple nodes is critical. The NVIDIA Grace CPU Superchip has also demonstrated performance scaling across multiple nodes in a popular computational fluid dynamics (CFD) application.

Figure 3. NVIDIA Grace CPU Superchip multi-node scaling on OpenFOAM

OpenFOAM v2312Input: Motorbike 35M and 68M cells Intel x86 platform results computed on “EoS”, NVIDIA DGX SuperPOD H100 systems OS: Ubuntu 22.04 Compiler: 2024.0.1Grace Superchip results computed on internal NVIDIA MGX Evaluation Cluster composed by 16 Supermicro MGX ARS-221GL-N nodes with NVIDIA Grace Superchip 480 GByte and NVIDIA Infiniband Connect-X7 NDR400 OS: Ubuntu 22.04 Compiler: GCC 13.10

Customer momentum

Customers are quickly adopting the NVIDIA Grace family of products for generative AI, hyper-scale deployments, enterprise compute infrastructure, high-performance computing (HPC) and scientific computing deployments, data analytics, intelligent edge platforms, and more.

For example, NVIDIA Grace Hopper-based systems deliver 200 exaflops, or 200 quintillion calculations per second of energy-efficient AI processing power in HPC.

The following HPC centers are all deploying NVIDIA Grace CPU-only based systems:

Customers such as Murex, Gurobi, and Petrobras are seeing compelling performance results in financial services, analytics, and energy verticals that are demonstrating the benefits of NVIDIA Grace CPUs and NVIDIA GH200 solutions.

High-performance CPU architecture

The NVIDIA Grace CPU was engineered to deliver exceptional single-threaded performance, ample memory bandwidth, and outstanding data movement capabilities, all while delivering a large leap in energy efficiency compared to traditional x86 solutions.

To achieve this combination of high performance and outstanding energy efficiency, the NVIDIA Grace CPU Superchip incorporates many newly developed architecture innovations:

NVIDIA Scalable Coherency Fabric

Server-grade LPDDR5X with ECC

Arm Neoverse V2 Cores

NVLink-C2C

NVIDIA Scalable Coherency Fabric

A key challenge is ensuring that all cores, cache, memory, and high-speed system I/Os don’t have bottlenecks to get the most out of the architecture. NVIDIA Scalable Coherency Fabric (SCF) (Figure 1) is a mesh fabric and distributed cache architecture designed by NVIDIA to meet those challenges to scale cores and bandwidth in a power– and area-efficient manner.

NVIDIA SCF also enables memory coherency between another NVIDIA Grace CPU in a Superchip configuration or with a GPU in an NVIDIA Grace Hopper or NVIDIA Grace Blackwell configuration.

The CPU cores and SCF cache partitions are distributed throughout the mesh, while cache switch nodes route data through the fabric and serve as interfaces between the CPU, cache memory, and system I/Os.

SCF provides over 3.2 TB/s of total bisection bandwidth to keep data flowing between the CPU cores, NVLink-C2C, memory, and system I/O. The SCF reduces bottlenecks in data movement-heavy applications, such as graph analytics where NVIDIA Grace delivers up to 2x the performance of leading x86 servers.

Figure 4. NVIDIA Grace CPU and the NVIDIA SCF

Server-grade LPDDR5X with ECC

Datacenter CPUs require high-bandwidth, high-capacity memory subsystems. At the same time, these memory subsystems must be power-efficient to ensure that as much power as possible is allocated to the CPU cores.

The NVIDIA Grace CPU Superchip uses up to 960 GB of server-class, low-power, double data rate 5X (LPDDR5X) memory with error-correcting code (ECC). The NVIDIA Grace memory subsystem delivers up to 500 GB/s of bandwidth while only consuming about 15W of power, substantially lower than standard dual in-line memory module (DIMM)-based designs.

This design strikes the optimal balance of bandwidth, energy efficiency, capacity, and cost for large-scale AI, HPC, and cloud workloads.

Arm Neoverse V2 cores

Even as the parallel compute capabilities of GPUs continue to advance, workloads can still be gated by serial tasks run on the CPU. For maximum workload acceleration, a fast and efficient CPU core is critical to system design.

At the heart of the NVIDIA Grace CPU are the Arm Neoverse V2 CPU cores. Neoverse V2 cores are optimized to deliver industry-leading performance per thread while also providing exceptional energy efficiency compared to traditional CPUs.

The NVIDIA Grace CPU Superchip integrates up to 144 high-performance Arm Neoverse V2 cores with a Scalable Vector Extension version 2 (SVE2) 4x128b single-instruction, multiple-data (SIMD) pipeline per core to deliver 2x the data center performance efficiency of the latest-generation x86 servers.

NVLink-C2C

To create the NVIDIA Grace CPU Superchip with up to 144 Arm Neoverse V2 cores and avoid bottlenecks when moving data between the chips, the NVLink Chip-2-Chip (C2C) interconnect provides a 900 GB/s direct connection between chips.

A typical server architecture has two sockets, each composed of multiple dies where each die may represent up to 8 multiple non-uniform memory (NUMA) domains, over 800W of CPU and memory power, and 500 GB/s of bandwidth between the nodes.

The Grace CPU Superchip uses a clean and simple memory topology. With only two NUMA nodes, 500W of CPU and memory power, and a 900 GB/s high-bandwidth NVLink-C2C, the Grace CPU Superchip helps alleviate NUMA bottlenecks for application developers and users.

Figure 5. Comparison of NVIDIA Grace and x86 system architectures

The connection provides unified cache coherence with a single memory address space that combines system and HBM GPU memory for simplified programmability. This coherent high-bandwidth connection between CPU and GPUs makes over 600 GB of fast memory available to the GPU and is key to solving the most complex AI and HPC problems.

NVIDIA Grace Hopper

As AI evolves from pilot projects to mainstream use, it is increasingly being integrated into conventional CPU-based workflows and enterprise applications. This integration blurs the boundaries between CPUs and GPUs, necessitating a new kind of converged accelerated computing architecture.

Traditionally, accelerators connect to the CPU over PCIe, which can bottleneck data transfer, and the processors have separate memory pools.

The NVIDIA Grace Hopper architecture brings together the groundbreaking performance of the NVIDIA Hopper GPU with the versatility of the NVIDIA Grace CPU in a single Superchip, connected with the high-bandwidth, memory-coherent 900 GB/s NVIDIA NVLink Chip-2-Chip (C2C) interconnect that delivers 7x the bandwidth of the PCIe Gen 5.

NVLink-C2C memory coherency increases developer productivity, performance, and the amount of GPU-accessible memory. CPU and GPU threads can concurrently and transparently access both CPU and GPU resident memory, enabling you to focus on algorithms instead of explicit memory management.

Figure 6. Grace Hopper architecture overcomes PCIe bottlenecks

One example of an emerging workload that blends CPU and GPU processing is retrieval augmented generation (RAG). RAG workloads have gained adoption in the enterprise due to their ability to ground LLMs in a corporate knowledge base, thereby reducing model hallucination.

RAG requires the constant conversion of internal corporate documents and digital assets into embeddings that are then stored in a vector database for fast retrieval during the Inference phase. Enterprises serving RAG workloads revert to running the embedding generation, vector database creation and indexing, and vector search phases of the workload on CPUs, reserving GPUs for the Inference phase.

With NVIDIA Grace Hopper, enterprises can run both phases of the RAG workload on a unified accelerated compute architecture. This accelerates RAG workload performance by up to 1.5x on the popular Llama 2 70B model when compared to a system that couples an H100 GPU with a traditional x86 CPU.

The RAG workload uses the aforementioned performance of the NVIDIA Grace CPU, converged CPU and GPU memory, and 900 GB/s NVLink-C2C to accelerate all the non-inference phases of the RAG workload, while the NVIDIA Hopper GPU accelerates the inference phase.

In addition to the unique innovations in the Superchip itself, NVIDIA Grace Hopper is offered in a modular MGX server design, the GH200 NVL2, which connects two Superchips through NVLinkin a single server, simplifying deployment and scale out for mainstream LLM inference.

IT leaders and decision-makers aiming to balance cost-effectiveness with user experience often use a model-sharding strategy to serve mainstream LLMs in production. This involves dividing a single model across multiple GPUs connected by low-latency, high-bandwidth networks.

This method increases the number of users that can be served, reducing costs while also ensuring good user experiences. It also enables organizations to start with a smaller setup and scale out by adding GPUs as demand grows.

The NVIDIA GH200 NVL2 modular single-node design makes it ideal for mainstream LLM model serving and scale-out architectures.

By embracing this new breed of hybrid accelerated Superchips and their new converged memory simplified programming model, IT leaders and decision-makers expanding or retrofitting their data centers can ensure that they are laying down a robust foundation that accommodates the demands of not only traditional serial processing applications and AI-enhanced applications but also the next generation of AI-driven innovations.

NVIDIA Grace Blackwell

NVIDIA GB200 NVL72 connects 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs in a rack-scale design, supercharging generative AI, data processing, and high-performance computing.

NVIDIA Blackwell features 208B transistors and a second-generation transformer engine. It supports the fifth-generation NVIDIA NVLink, which boosts 1.8 TB/s bidirectional throughput per GPU, delivering unparalleled acceleration to the GPU-to-GPU operations that occur in multi-GPU deployments of trillion-parameter models with parallelism combinations.

GB200 NVL72 is delivered as a liquid-cooled, rack-scale solution boasting a 72-GPU NVLink domain that acts as a single massive GPU. This enables it to deliver 30x faster inference on a state-of-the-art trillion-parameter mixture of expert LLM models.

Leading cloud service providers have announced plans to adopt NVIDIA Grace Blackwell:

Standard software infrastructure built on the Arm software ecosystem

All major Linux distributions, and the vast collections of software packages they provide, work perfectly and without modification on NVIDIA Grace. Applications, libraries, dependencies, utilities, tools, and so on can be trivially installed using your OS package manager.

Many household-name applications, both closed and open source, provide optimized executables for Arm. The Arm Developer Hub provides a showcase of selected software packages for AI, cloud, data center, 5G, networking, and edge. This ecosystem is enabled by Arm standards, such as the Arm Server Base System Architecture (SBSA) and the Base Boot Requirements (BBR) of the Arm SystemReady Certification Program.

NVIDIA Grace implements these standards and also uses the popular Neoverse microarchitecture, so software optimizations for other widely available Arm CPUs directly benefit NVIDIA Grace as well. For more information about how to install and configure software, see the NVIDIA Grace documentation.

Figure 7. NVIDIA Grace family software ecosystem

In addition to the broader Arm software ecosystem, the NVIDIA software ecosystem is available and optimized for NVIDIA Grace. The NVIDIA HPC SDK and every CUDA component have Arm-native installers and containers. NGC also provides deep learning, machine learning, and HPC containers optimized for Arm.

NVIDIA is also actively expanding its software ecosystem for Arm CPUs. Recently, NVIDIA launched a new suite of high-performance math libraries for Arm CPUs called NVIDIA Performance Libraries (NVPL) These libraries are a drop-in replacement for most x86 math libraries and are highly tuned to maximize Grace CPU performance.

NVIDIA also distributes upstream Arm optimizations to open-source tools, like Clang, for developers who don’t want to wait for regular releases but do want to build code that performs optimally.

Porting and optimizing software for Arm and NVIDIA Grace

The NVIDIA Grace CPU is a standards-based design that is fully compatible with the broad Arm software ecosystem, so most of the porting work has already been done.

Recompiling application source code natively on NVIDIA Grace with optimal compiler flags, as described in this post, can boost application performance and efficiency. Most applications can be compiled using any modern, standards-compliant, multi-platform compiler without modifying the application source code:

Figure 8. Software run on the NVIDIA Grace family just works and uses existing tools

Here are the basic steps for compiling an application on NVIDIA Grace:

Install software dependencies: Use your operating system’s package manager to install the same compilers, libraries, toolchains, runtimes, frameworks, and so on that you would use on any other CPU. All recent versions of popular dependencies are available for NVIDIA Grace.

Use standards-compliant compilers: Use GCC, Clang, or NVHPC compilers exactly as you would on any other CPU. If you’re using a vendor-specific compiler such as AOCC, update your build system to invoke a standards-compliant multi-platform compiler such as NVHPC. These multi-platform compilers can also be used on the original system as well, so application portability is improved.

Optimize compiler flags: Remove all architecture-specific flags such as -mavx, -march, and -mtune for GCC and Clang, or any -tp flags for NVHPC. In their place, add the flag -mcpu=native for GCC and Clang. NVHPC automatically detects the NVIDIA Grace native compilation and uses the most optimal flags, so no additional flags are needed. You may also wish to enable link-time optimization with the -flto flag for GCC and Clang.

Following these simple steps can produce optimized application binaries for NVIDIA Grace in a matter of minutes.

For more information about application porting and optimization, see the NVIDIA Grace CPU Benchmarking Guide. This guide includes precise step-by-step instructions for building and running common benchmarks (STREAM, HPL, HiBench, protobuf, and so on) and applications (WRF, OpenFOAM, SPECFEM3D, NAMD, and so on) on NVIDIA Grace.

It also provides high-level developer guidance on Arm SIMD programming, the Arm memory model, and language-specific guidance for C/C++, Fortran, Java, Python, and Rust.

Use this guide to help you realize the best possible performance for your particular NVIDIA Grace system.

Summary

The NVIDIA Grace CPU is designed for the modern data center with 72 high-performance Arm Neoverse V2 cores, an NVIDIA designed high-bandwidth fabric to maximize performance, high-bandwidth low-power memory and delivers up to 2x the performance in the same power envelope as leading traditional CPUs.

The Grace CPU has a fast coherent link to connect with other Grace CPUs or with either NVIDIA Hopper or NVIDIA Blackwell GPUs to form a new type of processor that tightly couples the CPU and GPU to supercharge generative AI, data processing, and accelerated computing.

The NVIDIA Grace CPU is a standards-based design that is fully compatible with the broad Arm software ecosystem and most software will just work.

For more information, see this collection of sessions from GTC 2024 to learn more.