Convergence comes to GPU processing for C4ISR

We’ve seen it before: the trend towards convergence that adds formerly discrete functions into solutions that used to be vertically distinct. These days, for example, word-processing programs are hard to distinguish from page-layout programs. Or how about: Your phone is now a web browser. The same trend is now starting to affect how embedded systems designers leverage for , , , , , , and Reconnaissance () applications such as () and electronic warfare (EW). Until recently, a designer looking to take advantage of the large matrix of math-friendly single precision floating-point cores that make up a had few options. The most common approach was to go with an appropriate dedicated GPU from , such as their Fermi or Kepler families, or one of ’s Radeon devices. Times have changed, however; today, GPU options have expanded to include Intel’s Core i7 products, whose built-in GPU functionality and AVX math library support continues to grow. On the front, we’re starting to see devices with built-in , while discrete GPU devices are delivering expanded functionality as well.

One of the key factors helping to make alternatives to dedicated more attractive for C4ISR system designers is the huge role that latency plays in SIGINT and EW applications. GPUs come out of the video and gaming markets where they are used to drive millions of pixels. While military system designers have put the massive floating-point throughput and internal pipelining architecture of GPUs to good use, these devices can’t deliver the low latency needed for some of the most demanding applications. For many of these sense-and-response applications, dataflow must take place in nanoseconds and can literally be a matter of life and death. For some and SIGINT applications the latency performance of dedicated GPUs is acceptable, while for others it’s not. The latency experience with GPUs results mostly from their use of multiple lanes of (PCIe) to move data off-chip. In contrast, Intel’s mobile class quad-core Core i7 processor has added an embedded GPU right next to the CPU. For example, the latest generation Core i7, the 4th generation “Haswell” device, features a GT2 embedded GPU in the silicon. The device’s GPU and four processor cores are all interconnected at their last level cache. Because the data doesn’t have to go off-chip to be processed, latency is greatly reduced compared to a dedicated GPU device. Even better, one can reasonably assume – based on past history and current Internet scuttlebutt – that the next generation of Core i7s will boast an even larger GPU. If Intel decides to double the GPU and increase the size of the cache, the result could be performance that rivals today’s discrete GPU devices.

Convergence is also increasing on the software side. One trend that can help increase the use of general-purpose processors as math engines for C4ISR is the increasing popularity of OpenCL for parallel programming. Today, OpenCL can be used to program the Core i7’s CPU and the AVX vector engines in the device’s embedded GPU. Increasingly, though, OpenCL support is also found on FPGAs from leading vendors such as Altera and Xilinx, with Texas Instruments supporting OpenCL on its ARM devices. The promise is of a nearly “universal” programming language that can be used for programming the heterogeneous architectures typically found in C4ISR systems.

Another convergence trend seen on the FPGA front is the addition of floating-point units into devices. One could ask whether the resulting device looks more like a GPU than an FPGA. In addition to being programmable with OpenCL, FPGAs add another powerful advantage over GPUs in that FPGAs are extremely flexible when it comes to high speed . Whether you require low-voltage differential signaling, differential signaling, or high-speed serial, FPGAs offer much greater flexibility when compared to discrete GPUs with their multiple lanes of PCIe.

GPU vendors aren’t slowing down either: One convergence trend is the addition of ARM cores into discrete GPUs. Formerly, discrete GPUs required an Intel device to function as a proxy for system management. With a built-in ARM processor, however, the GPU can become much more autonomous: It will be able to directly receive data, stand itself up, issue its own context switching and dynamic cache allocation, and the like. This setup can free discrete GPUs from the massive latency hit they now endure while changing modes, for example.

The good news is that all of this convergence, effectively expanding the periodic table of compute elements, will give C4ISR system designers so many more choices. OpenCL becoming virtually ubiquitous across so many platforms will enable system integrators to preserve a lot of their IP from tech refresh to tech refresh or from application to application. The application will then determine whether it is best to use a general-purpose processor, an FPGA, or a discrete GPU.

Marc Couture Senior Product Manager, Intel, PowerPC, and GPGPU-based DSPs, ISR Solutions Group