Intel Architecture: All set for high-performance military embedded signal processing applications
The latest Intel Core i7 dual-core processors provide users with an alternative to Power Architecture processors for DSP applications.
Historically, processors from the PowerPC family, now known as Power Architecture processors, have been the dominant choice for implementing Digital Signal Processing (DSP) in high-performance embedded military applications that take advantage of open-system COTS products. Today, however, beginning with Intel Core i7 dual-core processors, low-power, high-performance Intel Architecture processor technology provides, for the first time, an attractive alternative for designers of DSP engines for the rugged deployed COTS signal processing space.
Signal processing evolution
Since the 1990s, processors from the PowerPC family (also known as Power Architecture) and their AltiVec floating-point vector math unit have been the dominant choice for open-system COTS boards used in high-performance embedded military DSP applications. These applications include radar, signals intelligence, sonar, and image processing. Previously, such systems were largely implemented with specialized processors such as the Intel i860, the Texas Instruments 320C40, and the Analog Devices SHARC. These processors were popular because of their floating-point performance.
In the late 1990s, the COTS market turned to the PowerPC processors developed by an alliance of Apple, IBM, and Motorola (later Freescale) and intended for personal computers. The resultant high-performance microprocessor was based on a RISC architecture, but it was the introduction of the AltiVec instruction unit in the Motorola PowerPC 7400 (“G4”) that changed the signal processing landscape.
Signal processing experts were quick to recognize that the floating-point capable AltiVec unit could greatly accelerate the inner-loop processing found in common functions such as Fast Fourier Transforms (FFTs). AltiVec’s ability to perform up to four simultaneous floating-point multiplies and additions was, at the time, revolutionary.
FFT performance on an Intel Core i7 processor
One of the most common signal processing algorithms is the FFT. The FFT implementation shown in Figure 1 is a version that is included in the Intel Performance Primitive (IPP) library.
This example uses 32-bit single-precision complex floating-point samples. The FFT is implemented for different sizes, and the number of cycles per sample has been measured. The results were profiled on an Intel Core i7 processor running at 2.67 GHz. The processor has four cores, but these tests only use a single thread. (Note that the Intel Core i7 processor utilizes the Intel Microarchitecture in a 32 nm fabrication process.)
The IPP implementation of an N point FFT uses a complex multiplication taking six operations (2MUL & 2ADD) and a complex addition taking two operations (2ADD) for each point. Since a MUL takes four operations, this amounts to 8N.log2N floating-point operations (FLOPS). By calculating the number of FLOPS per cycle, the sustained GigaFLOP performance can be derived. A single core is capable of 20 to 30 GFLOPS for FFT execution, which is up to more than 90 percent of theoretical capability.
In the meantime, Intel continued to develop the floating-point capability of its own processors, including a vector-processing unit generically known as Streaming SIMD Extensions (SSEs), first introduced in the Pentium III processor. Intel has continually added features and new instructions, culminating in the current implementation, SSE 4.2.
Like AltiVec, SSE is a 128-bit wide processing unit, capable of simultaneously operating on four 32-bit floating-point values. SSE also features support for double-precision floating point, a feature that was never included in AltiVec. (Note that Freescale has decided not to include the AltiVec unit in its latest high-performance processor, the QorIQ P4080. The P4080, announced last year, is an excellent CPU for single board computer designs because of its eight cores, integrated memory controllers, and Serial RapidIO interface; however, it features a regular floating-point capability that is not the vector processor type required to attain the floating-point performance needed for signal processing applications.) In multicore Intel processors, each core has its own SSE unit, so the raw floating-point performance scales with the number of cores.
Additionally, Intel x86 processors are classic CISC processors. Successive generations of Intel processors continue to dispatch more instructions per clock. Since many more instructions are executed per clock cycle and the code density is higher, Intel processors can perform more than twice the useful work per clock cycle as a Freescale RISC processor. As a result, beginning with Intel Core i7 dual-core processors, the low-power, high-performance advantages of the Intel Architecture processor technology can be used for the first time to design products such as DSP engines for the rugged deployed COTS signal processing space.
Intel Architecture meets signal processing performance needs
The latest generations of Intel Architecture processors are produced on 45 nm and 32 nm process technologies and are based on the Intel Microarchitecture, which includes many features that suit high performance and power-efficient execution of signal processing workloads.
To support high instruction throughput, the Intel Microarchitecture contains a sophisticated memory subsystem. In a quad-core processor, each core contains a first-level instruction cache (32 KB 4-way), a first-level data cache (32 KB 8-way), a second-level unified cache (256 KB 8-way), and a third-level cache of up to 8 MB 16-way that is shared among all the processor cores. With 2 or 3 DDR3 memory controllers, the processor can provide a peak memory bandwidth of 17.1 or 25.6 GBps. This high-throughput capability is required to support the multi-gigabit rates for the processing of the sample streams in military signal processing applications such as radar.
Support for the efficient implementation of high-throughput signal processing is based on SSE instructions, which are extensions to the standard Intel Instruction Set Architecture (ISA). Including the latest generation, SSE 4.2, there are more than 300 SSE instructions. SSE operations work from a set of 16 128-bit wide XMMx registers, capable of simultaneous operation on four packed floating-point values, as well as other formats.
Effective implementation of signal processing algorithms requires efficient use of all resources on the processor platform, so the ability to parallelize algorithms across multiple cores in a linear manner is essential. Parallelized scaling across the multiple cores of an Intel Microarchitecture-based platform can be executed for common operations used in signal processing such as complex multiplication, or for more computationally intense algorithms. A threading model can be used to implement the complex multiplication algorithm with parallel execution.
A single quad-core processor Intel Core i7 platform can be used to execute the complex floating-point multiplications. The results depicted in Figure 2 show the expected linear performance scaling from one to four threads, as additional cores and SSE vector units are employed in the algorithm. The eight-thread case demonstrates that additional efficiency can be obtained from the hyper-threading feature of the cores, even though the floating-point calculation resources of the core remain the same between the four-thread and eight-thread case.
Curtiss-Wright’s first multiprocessor DSP board products will be based on the recently announced dual-core Intel Core i7 610e. The first two products based on the Intel Microarchitecture are the CHAMP-AV5 6U VME64x DSP engine and the SVME/DMV-1905 SBC. Additionally, an Intel Core i7 architecture dual-core processor OpenVPX Ready (VITA 65) variant of the CHAMP-AV5 DSP, the CHAMP-AV7, is scheduled for release in the summer of 2010. Using two 2.53 GHz dual-core Intel Core i7 processors, the CHAMP-AV5 delivers performance rated up to 81 GFLOPS. With 4 MB of cache and two hardware threads per core, the Core i7 can process larger vectors at peak rates significantly greater than was possible with previous AltiVec-based systems.
Intel Corporation 408-765-8080 www.intel.com
Curtiss-Wright Controls Embedded Computing 703-779-7800 www.cwcembedded.com