Signal processing FPGAs with hard floating-point – No compromise

Since the dissolution of cutting-edge digital signal processor (DSP) product lines designers have been forced to develop using either FPGAs integrated with time-consuming fixed-point DSP blocks, or floating-point general-purpose graphics processing units (GPGPUs) that leave performance on the table in high-end signal processing systems. But now, with the release of Altera’s Generation 10 FPGAs that integrate hardened IEEE 754-compliant floating-point operators, why compromise?

The days of high-end, general-purpose DSPs effectively ended a decade ago with the demise of Device’s TigerSHARC roadmap after Texas Instruments (TI) had previously discontinued their high-end roadmap. Since then, TI has brought back some high-end parts for targeted applications, but they still have not presented a roadmap for increased performance. There are still plenty of application-specific (ASIC/ASSP) and low-end processors available, but general-purpose applications requiring high performance must now rely almost exclusively on GPUs and FPGAs.

FPGAs have been used to implement high-performance DSP algorithms for a long time, but they require specialized and complex development. Many years ago vendors made a big leap by adding hard DSP blocks within the gate array to significantly improve signal processing performance and simplify algorithm implementation. Unfortunately, those hard DSP blocks were all fixed-point.

Floating-point clearly a superior format

Much can be said about numerical representation of data for signal processing. When using fixed-point there are numerous options, each with their own benefits and limitations, including integer, fractional, signed, unsigned, block exponents, and combinations thereof; however, most of the complications and tradeoffs encountered with fixed-point data representations can simply be avoided by using a floating-point format (a concise overview of these issues can be found in Michael Parker’s book “ 101”). Suffice it to say that no one prefers fixed-point over floating-point, as floating-point is clearly a superior data format, and as a result virtually all algorithmic development and simulation is done in floating-point during system design. A very high-level comparison of the relevant characteristics of fixed- and floating-point implementation can be found in Table 1.

21
Table 1: A high-level comparison of fixed- and floating-point.
(Click graphic to zoom)

Several years ago, GPUs added floating-point capabilities to their parallel network of computer engines, creating the “GPGPU,” a general-purpose useful for much more than just graphics. Because they were originally architected for graphics processing, however, and not signal processing, GPGPUs still suffer from a number of challenges when used in general-purpose signal processing applications, including high power, single instruction/multiple data (SIMD) programming complexities, memory bottlenecks, and often disappointingly low effective processing rates relative to peak capabilities. In addition, GPGPUs are limited by a lack of flexibility, limitations of the memory types they can support, and the inability to interface directly to analog converters for signal I/O, or any I/O for that matter. Finally, the GPU can become data starved unless there is a high degree of calculation to be done on each data point, since the host GPU must provide data over a PCIe link to the GPU, which is a liability for most traditional stream-based signal processing applications. Therefore, FPGAs have continued to thrive in high-performance signal processing applications despite their lack of floating-point resources.

FPGA designers have become extremely skilled at fixed-point implementations, but there is still a significant cost associated with these fixed-point implementations. Once the algorithm simulation is completed in floating-point, there is typically a further six- to 12-month effort to analyze, convert, and verify a floating-point algorithm in a fixed-point implementation. First, the floating-point design must be converted manually to fixed-point, which requires an experienced engineer. Second, any later changes in the algorithm must be converted manually again; also, any steps taken to optimize the fixed-point algorithm in the system are now not reflected in the simulation. Third, as problems arise during system integration and testing, debug time increases inordinately as the possible causes could be an error in the conversion process, a numerical-accuracy problem, or the fact that the algorithm itself is just defective. The advantages of floating-point are so compelling that often “soft” floating-point is implemented using hard fixed-point DSPs, which causes other deleterious impact on design time while consuming significantly greater FPGA resources and slowing performance.

Therefore, since the demise of dedicated high-end DSP processors, applications requiring high-end signal processing have been forced to choose between the lesser of two evils: fixed-point FPGAs or floating-point GPGPUs. While both of those options have had a great deal of effort put into them to make them less “evil” via tricks, techniques, and prebuilt intellectual property (IP), these options have generally required significant compromise – until now.

Floating-point comes to FPGAs

With the recent introduction of its Generation 10 FPGAs, has become the first to integrate hardened IEEE 754-compliant floating-point operators in an FPGA, as shown in Figure 1. These hardened floating-point DSP blocks change the decade-old paradigm of having to choose between the lesser of two evils, and removes the need to compromise.

21
Figure 1: The variable precision DSP block of Altera’s Generation 10 FPGAs is shown here in floating-point mode.
(Click graphic to zoom)

The floating-point computational units, both multiplier and adder, are seamlessly integrated with existing variable-precision fixed-point modes. This provides a 1:1 ratio of floating-point multipliers and adders, which can be used independently as a multiply-adder or multiply-accumulator. Since all the complexities of IEEE 754 floating-point are within the hard logic of the DSP blocks, no programmable logic is consumed and similar clock rates as used in fixed-point designs can be supported in floating-point, even when 100 percent of the DSP blocks are used. In addition, while designers still have access to all the fixed-point DSP processing features used in their existing designs that support backward compatibility, they can easily add or upgrade all or part of the design to single-precision floating-point as desired.

With thousands of floating-point operators built into these hardened DSP blocks, Arria 10 FPGAs are available from 140 GigaFLOPS (GFLOPS) to 1.5 TeraFLOPS (TFLOPS) across the 20 nm family. Altera’s 14 nm Stratix 10 FPGA family will use the same architecture, extending the performance range right up to 10 TFLOPS, the highest ever in a single device. This situation means that FPGAs can now compete directly with GPGPUs for raw processing performance without compromising the FPGA’s previous advantages of inherent flexibility, support for a variety of memory and I/O types, and the ability to directly connect to signals.

Floating-point simplifies FPGA development

The addition of native floating-point also greatly improves the ability to leverage higher-level languages, tools, and compilers for coding of FPGA applications, thus addressing any lingering ease-of-use concerns. Existing model-based flows such as DSP Builder Advanced Blockset and MathWorks’ MATLAB and Simulink tools, as well as higher-level language compilers such as OpenCL, are now able to be far better and more efficient without having to map to fixed-point numerical issues such as managing bit growth, truncation, saturation, and the like, allowing the use of integers to be restricted to more effective roles in semaphores, memory indexing, and loop counters.

While OpenCL, which is the open standard equivalent of CUDA, can be used for both FPGAs and GPGPUs, there are notable differences in how algorithms are implemented. GPGPUs use a parallel-processor architecture, with thousands of small floating-point mult-add units operating in parallel. The algorithm must be broken up into thousands of threads, which are mapped to the available computational units as the data is made available. On the other hand, FPGAs use a pipelined logic architecture where the thousands of computational units are usually arranged into a streaming data flow circuit, which operates on vectors; this setup is more typical of signal processing components such as an FFT, filters, or Cholesky decomposition.

Floating-point enables FPGAs to compete in high-performance computing

While FPGAs are relatively new to high-performance computing, they can provide some compelling advantages. First, due to the pipelined logic architecture, the latency for processing a given is much lower than on a GPU. This can be a key advantage for some applications, such as financial trading algorithms. Second, FPGAs with native floating-point achieve four to eight times higher GFLOPS per watt than GPGPUs. This improved efficiency can be critical in applications such as high-performance embedded computing (HPEC), where an FPGA can perform far more computations within a limited power budget; it’s also becoming a huge advantage in big-data processing and data centers due to the reduced operating costs. Third, the FPGA has an incredibly versatile and ubiquitous connectivity. The FPGA can be placed directly in the data path and process the data as it streams through. Altera has specifically added the option of data streaming to its OpenCL tools, which is in compliance with the OpenCL vendor extension rules.

FPGAS with native floating-point easy to deploy

One final and oft-overlooked aspect of deploying signal processing implementations is the availability of quality hardware. Even the greatest FPGA with the most robust and easiest to use development tools won’t do much good if it takes a team of hardware engineers six to 12 months to build and debug a board to host the application. Fortunately, like GPGPUs, high-quality deployable board-level solutions for FPGAs with native floating-point are available off-the-shelf. Vendors such as BittWare offer a variety of PCIe and embedded formats ( and AMC). Complete with drivers, system integration software, FPGA development kits (FDKs), board monitors, and board support packages (BSPs) for OpenCL, boards such as the A10PL4 and A10P3S greatly lower the barrier to entry for implementing high-performance signal processing on FPGAs (Figures 2 and 3).

21
Figure 2: BittWare’s A10PL4 low-profile PCIe board is based on Altera’s Arria 10 FPGA, and includes 8x 10 GbE, 2x 40 GbE, or 2x 100 GbE network interfaces and up to 32 gigabytes of DDR4.
(Click graphic to zoom)

21
Figure 3: This block diagram depicts BittWare’s A10P3S half-length PCIe board with Arria 10 FPGA.
(Click graphic to zoom)

With integrated hard floating-point DSPs, high-level development tools, and sophisticated processing boards, FPGAs now have everything needed for efficiently and effectively implementing high-performance signal processing applications. There’s no longer a need to compromise.

Jeff Milrod received his bachelor’s degree in physics from the University of Maryland, and MSEE degree from Johns Hopkins University. After gaining extensive design experience at NASA and business experience at Booz, Allen & Hamilton, Jeff started Ixthos in 1991, one of the first companies dedicated to COTS DSP. He ran Ixthos until it was acquired by DY4 Systems (now Curtiss-Wright Defense Solutions) in 1997 before taking the helm of BittWare, where he is President and CEO.

Bittware, Inc. www.bittware.com j_milrod@bittware.com LinkedIn Facebook YouTube