NVIDIA launches liquid-cooled A100 GPU, handheld server production module will go on sale soon

At Computex 2022, Nvidia announced the liquid- cooled A100 PCIe GPU to meet customer demand for high-performance carbon-neutral data centers. This is the first of its kind in a mainstream server GPU.

At the same time, NVIDIA announced that a number of leading manufacturers adopted the world's first system designs based on NVIDIA's self-developed data center CPUs , and more than 30 global technology partners released the first NVIDIA Jetson AGX Orin-based edge AI and embedded computing at Computex system.

At present, NVIDIA is developing comprehensively around the three pillars of data center chips, namely CPU, GPU and DPU, to assist its partners in building a new wave of data center transformation and building modern AI factories. Among them, the CPU manages the operation of the entire system, the GPU is responsible for providing core computing power, and the DPU is responsible for handling secure network communications and providing network built-in computing power to optimize overall performance.

Brian Kelleher, senior vice president of hardware engineering at NVIDIA, revealed that NVIDIA sets the update cadence for each chip architecture to two years, one year will focus on the x86 platform, and one year will focus on the Arm platform , regardless of customer and market preference, NVIDIA Both the architecture and platform will support x86 and Arm.

Ian Buck, vice president of NVIDIA's accelerated computing business, said that if all the world's AI, high-performance computing, and data analysis workloads are run on GPU servers, NVIDIA estimates that it can save more than 12 trillion watt-hours of electricity every year, equivalent to Take 2 million cars off the road every year.

Liquid-cooled GPU: same performance, less power consumption

Liquid cooling technology was born in the mainframe era and matured in the AI ​​era. Today, liquid cooling technology is widely used in high-speed supercomputers around the world in the form of direct-to-chip cooling. NVIDIA GPUs are already 20 times more energy efficient than CPUs in AI inference and high-performance computing, and accelerated computing will naturally use liquid cooling technology.

Nvidia estimates that switching all of the world's CPU servers running AI and high-performance computing to GPU-accelerated systems could save up to 11 terawatt-hours of energy annually. The amount of energy saved could cover more than 1.5 million homes for 1 year.

Today, Nvidia released the first data center PCIe GPU with direct chip cooling technology. This liquid-cooled GPU, which reduces power consumption while maintaining performance, is now in beta and is expected to be officially released this summer.

Equinix, a global service provider that manages more than 240 data centers, has validated the A100 80GB PCIe liquid-cooled GPU in its data centers as part of the company's comprehensive approach to sustainable cooling and heat capture.

In separate tests, Equinix and Nvidia both found that data center workloads with liquid cooling can match that of air-cooled facilities while consuming about 30 percent less energy. Nvidia estimates that the PUE for liquid-cooled data centers could reach 1.15, well below the 1.6 for air-cooled data centers.

Liquid-cooled data centers can double the amount of computing in the same space. This is because the A100 GPU uses only one PCIe slot, while the air-cooled A100 GPU uses two PCIe slots.

Later this year, at least a dozen system makers, including Asus, H3C, Inspur, Ningchang, Supermicro, and Hyperfusion, will use liquid-cooled GPUs in their products.

It is reported that NVIDIA plans to launch a version of the A100 PCIe card next year with the H100 Tensor Core GPU based on the NVIDIA Hopper architecture. In the near term, NVIDIA plans to apply liquid cooling technology to its own high-performance data center GPUs and NVIDIA HGX platforms.

Dozens of NVIDIA Grace CPU-based servers will ship next year

Grace is NVIDIA's first data center CPU built for AI workloads. The chip is expected to ship next year and will be available in two form factors.

The Grace-Hopper on the left side of the above figure is a single super-chip module designed to accelerate large-scale AI, high-performance computing, cloud and hyperscale workloads. It implements a chip-level direct connection between the Grace CPU and the Hopper GPU. The GPUs communicate via NVLink-C2C, an interconnect technology with bandwidths up to 900GB/s.

According to Brian Kelleher, Grace will transfer data to Hopper 15 times faster than any other CPU and increase Hopper's working data size to 2TB.

At the same time, Nvidia also offers the Grace super chip that interconnects two Grace CPU chips through NVLink-C2C. The Grace super chip features 144 high-performance Armv9 CPU cores, memory bandwidth up to 1TB/s, and 2 times the energy efficiency of existing servers . The entire module, including 1TB of memory, consumes only 500W.

In addition to NVLink-C2C, NVIDIA also supports UCIe, the still-evolving chiplet standard released earlier this year.

Today, NVIDIA is announcing 4 Grace reference designs for standard data center workloads:

  • CGX for cloud games
  • OVX for digital twins and Omniverse
  • HGX suitable for high performance computing and supercomputing
  • HGX for AI training, inference and high performance computing

Immediately after, Nvidia announced the HGX Grace and HGX Grace Hopper systems, which will provide Grace Hopper and Grace CPU super-chip modules and their corresponding PCB reference designs. Both are designed as OEM 2U high-density server chassis for reference and modification by NVIDIA partners.

Dozens of server models of Grace systems from ASUS, Foxconn Industrial Internet, GIGABYTE, QCT, Supermicro and Wiwynn are expected to begin shipping in the first half of 2023.

The first Jetson AGX Orin servers and devices released

The NVIDIA Isaac robotics platform has four pillars: one is to create AI; two is to simulate the operation of a robot in a virtual world and then try it out in the real world; three is to build a physical robot; and four is to manage the entire lifecycle of a fleet of deployed robots .

When it comes to building and deploying real-world physical robots, NVIDIA Jetson has become the AI ​​platform for edge and robotics, with more than 1 million developers, more than 150 partners, and more than 6,000 companies using Jetson for volume production.

Jetson AGX Orin features NVIDIA Ampere Tensor Core GPUs, 12 Arm Cortex-A78AE CPUs, next-generation deep learning and vision accelerators, high-speed interfaces, faster memory bandwidth, multi-modal sensors, delivering 275 teraflops of performance , equivalent to a "handheld server".

With the same pin compatibility and form factor, it has 8x more processing power than its predecessor, the NVIDIA AGX Xavier.

The Jetson AGX Orin developer kit has been available globally through resellers since March, and production modules will be available in July starting at $399. The Orin NX module measures just 70mm x 45mm and will be available in September.

For edge AI and embedded computing applications, more than 30 global NVIDIA partners such as AAEON, ADLINK, and Advantech released the first batch of NVIDIA Jetson AGX Orin-based production systems at Computex, covering servers, edge devices, industrial PCs, on-board boards, AI software and other categories.

These products will be available in fanned and fanless configurations with a variety of connectivity and interface options, and will incorporate specifications for critical economic sectors such as robotics, manufacturing, retail, transportation, smart cities, medical, or ruggedized applications.

To accelerate the development of AMRs, Nvidia is also introducing Isaac Nova Orin, an advanced computing and sensor reference design for AMRs.

Nova Orin consists of 2 Jetson AGX Orin, supports 2 stereo cameras, 4 wide angle cameras, 2 2D lidar, 1 3D lidar, 8 ultrasonic sensors, etc. The reference architecture will be later this year roll out.

The Jetson platform also has full NVIDIA software support. To address the needs of specific use cases, the NVIDIA software platforms have been added: NVIDIA Isaac Sim on Omniverse for robotics, GPU-accelerated SDK Riva for building voice AI applications, and Riva for AI multi-sensor processing, video, audio, and image understanding DeepStream, a streaming analytics toolkit, and Metropolis, an application framework that integrates visual data and AI to improve industry operational efficiency and security, a developer toolset, and an ecosystem of partners.

Post a Comment