Floating-point coprocessors enable FPGAs to replace DSPs

Story

June 01, 2012

Jeff Milrod

BittWare

A coprocessor can greatly improve the productivity and algorithmic flexibility of an FPGA, thereby enabling it to handle a larger part of a signal processing implementation.

Using FPGAs for embedded military computing isn’t a new idea. Wikipedia calls it “reconfigurable computing” and traces it back to the 1960s. In theory, tailored hardware runs faster and uses less power than programmable CPUs, hence greatly improving Size, Weight, and Power (SWaP). Academics have thoroughly tested this idea. Adopters have shipped it. The reviews are in and they are mixed. FPGA computing sometimes falls short in practice. However, FPGA computing can be successful now, and that success can be extended with floating-point coprocessors that enable FPGAs to replace DSPs.

FPGAs are great – but not perfect (yet)

Outstanding FPGA success stories indeed exist. What do they have in common?

The system already contains FPGAs managing real-time I/O.
That I/O drives computation.

A marriage between I/O and computation describes a large percentage of DSP applications, and indeed FPGAs are frequently used in signal processing applications. But what additional factor is required for the successful deployment of FPGAs?

FPGAs computation succeeds when algorithms are “data independent.” Put another way, if the algorithm contains few “if” statements or can be expressed as a state machine, FPGA tools can translate that algorithm into efficient Register Transfer Level (RTL, hardware’s assembly language).
A second key insight is that FPGAs are appropriate platforms when the data-independent algorithm is “mature.” This is because expressing computations using gates requires more effort to optimize and to test than the programmable DSP alternative. Thus, to meet development schedules, the algorithm can’t be changing with every firmware release.

Identifying when things work well implies things aren’t so great in other cases. Using FPGAs is questionable when an algorithm isn’t mature (or is ever changing in response to new threats or modes), or if it is “data dependent” (that is, the algorithm changes based upon the specific data flowing through the chip). However, this doesn’t mean FPGAs won’t evolve to close the gap.

Prior attempts to extend the FPGA

An early solution was to put CPU cores inside an FPGA. This approach began with “soft” cores and extended to include hard cores. In 2002, Xilinx integrated “hard” PowerPC cores. Today industry leaders Altera and Xilinx both offer FPGAs with ARM cores inside. Theoretically, this could close both the maturity and data-dependency gaps.

Unfortunately, integrated cores have only successfully addressed the low end of the data-dependent and maturity challenges. We say “low end” because integrated cores have had narrower feature sets and lower clock speeds than separate DSP and embedded microprocessors. They are simply not powerful enough for most signal processing applications. However, integrated cores are ideal for out-of-band applications, such as hosting a USB protocol stack or controlling the processing of data through the rest of the FPGA.

Another attempt to bridge the gap has been High-Level Synthesis (HLS) tools and C-to-RTL or C-to-gates tools. Conceptually, these tools allow developers to abstract the FPGA or program it in standard C, thereby enabling rapid algorithm changes. In practice this hasn’t worked out, since low-level hardware dependencies are legion, and the abstractions and C end up having to depend upon language extensions that significantly deviate from standards and/or generate code that instantiates a runtime architecture that is effectively a soft processing core. Arguably these tools have reduced the gap, but specialized design skills and extensive compiles and simulations are still required. These problems are exacerbated with data dependencies.

A new approach to extend FPGAs

One new idea has potential for simultaneously closing the internal core/signal processing gap and the maturity gap. Rather than embed “low-end” hard or soft cores in the FPGA or try to abstract the FPGA to make it programmable with standard methodology, an external chip full of processing resources could provide standard C language design flows to the FPGA – a coprocessor to the FPGA, so to speak.

Ideally, it would be a highly efficient coprocessor tightly integrated with the FPGA to extend the performance of the FPGA while providing straightforward C programmability. Having it use floating point would further simplify the design and implementation of new and changing algorithms. Leveraging the trend to multicore would bring impressive peak performance numbers and could dramatically lower power consumption – over 30 GFLOPS per watt – making it more effective at addressing embedded military and SWaP demands.

The coprocessor could sit directly on the FPGA fabric, looking to VHDL or Verilog tools just like an embedded soft or hard core. This tight integration would give the coprocessor direct access to data inside the FPGA and allow very fine-grained interaction with the FPGA, and vice versa.

This new approach, a coprocessor for the FPGA, would combine the best of both worlds. It maintains the uncompromised strengths of the FPGA while bridging the maturity and data dependence gaps by leveraging the ease of use and power efficiency of a programmable multicore processor. As shown in Figure 1, such a device extends the capabilities of FPGAs to replace DSPs in many, if not most, embedded signal processing systems.

Figure 1: Adding a coprocessor to the FPGA can eliminate the need for DSPs.

(Click graphic to zoom by 1.9x)

A new approach to extend FPGAs – Implemented

What should this new coprocessor architecture look like? Perhaps the chip should feature a tiny core optimized for floating-point calculations that is integrated with a high-performance mesh network to allow scalable multicore implementations.

The core isn’t a PowerPC or MIPS chip; it is a new instruction set designed for efficiency and coprocessing. One tidbit that may provide understanding of the design trade-offs: The instruction set is built around a floating-point multiple/accumulate instruction (great for DSP); however, it is unusual in not offering an integer multiply instruction. Think no-frills, optimized computing. The result of this tight focus is that each core is tiny and can deliver 800 megaflops in 25 milliwatts, or 32 GFLOPS per watt. By comparison, a wristwatch consumes three times more power and runs for a year off a button battery.

The coprocessor reduces system development cost and directly bridges the FPGA’s gaps of requiring algorithmic maturity and data independence, by enabling out-of-the-box execution of applications written in regular ANSI C. It does not use any C subset, language extensions, SIMD, or other “funny stuff.” Standard GNU development tools are supported including an optimizing C compiler, simulator, GDB debugger with support for multicore, and an Eclipse multicore IDE.

Higher-level tools and abstractions such as OpenCL, multicore profilers, and optimized libraries further enhance the opportunity for the coprocessor approach to improve productivity.

Seeing the opportunity this new approach provides to users of FPGAs, BittWare partnered with a startup to develop just such a floating-point coprocessor for FPGAs. The resulting chip is the Anemone coprocessor for FPGAs.

The coprocessor uses 16 cores to balance performance with I/O to the FPGA, since fine-grained acceleration of an FPGA is all about data movement and synchronization. If more FLOPS are required, additional coprocessors can be gluelessly added to create seamless arrays of larger core counts (Figure 2). Future generations will boast up to 64 cores each and will deliver 96 GFLOPS of double precision floating-point processing while achieving efficiencies exceeding 50 GFLOPS per watt.

Figure 2: Floating-point coprocessor for FPGAs

(Click graphic to zoom by 1.9x)

To take make this new approach readily available for COTS military deployments, Anemone is available on an FMC card from BittWare, shown in Figure 3. Carrying a total of 64 cores on 4 coprocessors, it can be integrated on to any FMC carrier FPGA card, facilitating rapid deployment on both 3U and 6U VPX, convection or conduction cooled.

Figure 3: FMC board with four floating-point coprocessors for FPGAs

(Click graphic to zoom by 1.9x)

From the perspective of the FPGA, the coprocessor looks much like an embedded core. The coprocessor endpoint core sits directly on Altera’s Avalon fabric. This tight integration gives it direct access to data inside the FPGA (and vice versa). The coprocessor software tools support fine-grained interaction with the FPGA, as well as direct host access and code debug through the FPGA. Of course, the coprocessor is much faster than any internal core and uses very little power.

FPGAs made perfect?

It has been well understood that the inherent flexibility of an FPGA does not come for free. Many attempts to mitigate these costs have been tried in the past, but none of these has proven effective or achieved even moderately wide adoption. Adding an external coprocessor to FPGAs is a new approach that promises to finally succeed in bring C to FPGAs. This offers system designers the best of both worlds: the flexibility and massive resources of FPGAs combined with the ease of use and power efficiency of a programmable multicore processor, thus eliminating the need for DSPs. This approach could prove ideal for embedded applications in the evolving modern-day military that increasingly require high performance, productivity, flexibility, and adaptability – all while improving SWaP.

Jeff Milrod, realizing the futility of pursuing a career in music (and reluctantly admitting that his dad was right), went back to school and got a Bachelor’s degree in Physics from the University of Maryland and later an MSEE degree from The Johns Hopkins University. After gaining extensive design experience at NASA and business experience at Booz, Allen, Hamilton, Jeff merged his technical expertise with his improvisational skills, starting Ixthos in 1991 – one of the first companies (along with BittWare) dedicated to COTS DSP. He ran Ixthos until it was acquired by DY4 Systems (now Curtiss-Wright Controls Defense Solutions or CWCDS) in 1997. Jeff left in 1998 and took the helm of BittWare, where he is President and CEO.

BittWare 603-226-0404 www.bittware.com