Optimizing the edge through distributed disaggregation

Story

July 26, 2022

Anton Chuchkov

Mercury Systems

The paradigm for scaling rugged mission-critical processing resources at the edge is evolving rapidly. Disaggregating processing is now enabling low-latency, network-attached everything at the edge with high-speed Ethernet connectivity, from GPU servers to NVMe-over-fabric storage devices.

As technology continuously innovates to produce exponential improvements in processing and storage performance to keep pace with the demands of the digital world, new computing architectures must be considered. With edge environments constricting the requirements surrounding power, footprint, and latency, disaggregating compute resources is becoming a new way of architecting edge processing.

For edge-computing applications in the defense and aerospace field, mission platforms are typically required to stay active far longer than the underlying processing components. Consider that CPU manufacturers, such as Intel, release a new generation of x86 server-class processor every two to three years. To maintain state-of-the-art computing capabilities on a given platform, the default tech refresh approach taken by systems integrators is to respecify new server configurations with the latest processors, which translates to racks of equipment being swapped out every few years.

With each processor generation, new innovations roll out, including doubling of the PCIe bandwidth, more PCIe lanes for greater hardware support, faster memory speeds, and updated security features. Each new processing refresh, however, creates an increasing thermal challenge. Intel server-class CPUs, for example, have seen thermal design power (TDP) ratings double over the last four generational refreshes – from a 50 to 145 W range in the Broadwell processor generation to the current 105 to 300 W range in the third Xeon scalable processor generation. As such, swapping an older server with an updated replacement may pose conflicts with limited power budgets.

Processing pushed to the edge

Despite these challenges, advanced computing resources continue to move from data centers to deployed edge platforms, adding efficiency and new capabilities to applications such as radar signal processing. Such high-performance edge systems must be able to rapidly allocate – and re-allocate – parallel processing resources to handle data streams from multiple sensor sources through various types of algorithms, such as deep learning/machine learning (ML) neural networks for artificial intelligence (AI).

To optimize architectures, certain computing tasks are assigned to traditional CPUs with other hardware, such as graphics processing units (GPUs), given math-intensive duties where parallel processing is well-suited. Notably, GPUs have proven to exceed the capabilities of general-purpose processors in compute- and data-intensive use cases involving inferencing and training.

An example use case is with cognitive radar, which applies AI techniques to extract information from a received return signal and then uses that information to improve transmit parameters, such as frequency, waveform shape, and pulse repetition frequency. To be effective, cognitive radar must execute those AI algorithms in near-real-time. That, in turn, requires powerful GPUs in the processing chain. In AI inference benchmark tests performed by NVIDIA, an A100 GPU outperformed a CPU by 249x. By offloading tasks such as inferencing and training to GPUs, there is no longer a need to overspecify CPUs, which in turn presents an opportunity to decrease TDP.

The mission needs to keep up

Incremental power improvements gained from offloading tasks from CPU to GPU add up, but are not enough to keep pace with the needs of the edge environment. At the 2022 NVIDIA GTC event, Lockheed Martin Associate Fellow Ben Luke described this problem with power, latency, and sensor data at the edge: “One of the big challenges in modern sensors is that the data rates are ever increasing … there is also a strong desire to move that processing … closer to the edge, and that results in size, weight, and power constraints that are pressing to that architecture.”

Although tech refreshes may initially arise due to CPU life cycle hurdles, it is clear there are inherent advantages gained by updating to the latest hardware. There are critical improvements within every processing generation that enable the system to keep pace with the accelerating growth of sensor data as well as mitigate adversaries’ advancements. Directly related to Ben Luke’s comments is the hardware’s ability to provide decreased latency and time to decision.

On a datacenterHawk podcast about the future of edge computing and AI, Rama Darba, director of solutions architecture at NVIDIA, stated, “You cannot have AI or computational decision being made in the cloud via real time; there’s latency issues, there’s computational challenges.” Information that is not current is no longer relevant to make an informed decision. Particularly at the edge, making real-time decisions through inference-focused hardware, leveraging a trained model, relies heavily on the need for low latency.

Distributed processing enablers

The rugged data center at the edge can immediately benefit from disaggregation by embracing hardware such as data processing units (DPUs). DPUs, such as NVIDIA’s Bluefield shown in Figure 1, are sometimes described as smart NICs [networking interface cards], with additional integrated functionality, such as CPU processing cores, high-speed packet processing, memory, and high-speed connectivity (e.g., 100 Gb/sec/200 Gb/sec Ethernet). Working together, these elements enable a DPU to perform the multiple functions of a network datapath acceleration engine.

[Figure 1 | Shown is an NVIDIA Bluefield DPU card and key components. (Photo courtesy of NVIDIA.)]

One function very important to edge applications is the ability to feed networked data directly to GPUs using direct memory access (DMA) without any involvement by a system CPU. More than just a smart NIC, DPUs can be used as standalone embedded processors that use a PCIe switch architecture to operate as either root or endpoints for GPUs, NVMe storage, and other PCIe devices. Doing this enables a shift in system architectures: Rather than specifying a certain predetermined mix of GPU-equipped and general-compute servers, the DPU now enables the GPU resources to be shared, wherever it’s most required.

Enter the disaggregated distributed processing paradigm

A functional way of understanding the paradigm shift from the status quo to the newly enabled system architecture is by observing the data center as a whole processing pool of resources, rather than as a subset of servers, each with a dedicated function. In other words, the status quo had individual servers perform tasks – some for storage, others for parallel processing, and others for general services. While this model is essentially disaggregated by function, the critical missing element is the lack of distribution of those functions across multiple systems.

Consider the block diagram of a distributed, disaggregated sensor processing architecture (Figure 2). Parallel processing of mission-critical information such as sensor data is sent and performed on the GPU-enabled system, relayed to the DPU over high-speed networking and shared to any networked server for action.

[Figure 2 | A block diagram shows a use case of a data processing unit in a platform.]

Such an architecture also maintains low latency end-to-end, from sensor to GPU to networked server, irrespective of CPU generation in the server stack. To facilitate this new architecture, products such as Mercury’s rugged distributed processing 1U server disaggregate GPU resources and distribute insights directly onto the network without a standalone x86 host CPU. (Figure 3.)

[Figure 3 | A block diagram shows the makeup of the Mercury rugged distributed processing server.]

By distributing across the network, a greater portion of the resources can be used. Instead of specifying GPUs into each system and using a percentage of each GPU, fewer GPUs can be used and distributed to a greater number of systems, mitigating the trend toward thermal increase. Related to using fewer GPUs, NVIDIA’s Darba identified cost reduction as another key improvement from such an architecture: “One of the great advantages is that now, because you’re not in a place where you know you’re locked, having to run this application on this server, you could actually have huge reductions in server cost and server sizing.”

DPU use cases are not limited to GPUs and parallel processing. For instance, the GPU card could instead be a pool of drives, networked and appearing as local storage to any system. Whether it is parallel processing or storage, having the resource available to the network enables future scalability and refresh to newer, more capable hardware without a complete overhaul of existing systems or compromising on power budget or low latency.

Hardware that not only enables disaggregation, but also distribution of resources, presents an opportunity to align the needs of rugged mission-critical platforms with the latest technology through an innovative approach to architecting systems.

Anton Chuchkov is a product manager for the Edge Operating unit at Mercury Systems, focusing on rackmount products; he is responsible for introducing the latest industry technologies to the rugged market. He has worked in product management and in applications engineer roles at the chip, board, and system level for more than eight years. Anton holds a bachelor’s degree in electrical engineering from Stony Brook University. Readers may reach the author at [email protected].

Mercury https://www.mrcy.com/