Case study: Developing high-performance radar applications using the VSIPL++ API

Story

August 14, 2009

Don McCoy

CodeSourcery

The challenge was to implement high-performance Synthetic Aperture Radar (SAR) for two platforms in one week.

The need for high-performance Signal- and Image-Processing (SIP) applications is driving interest in parallel and multicore hardware for military embedded systems. However, programming such complex architectures can increase development costs by reducing developer productivity and code reuse. Leveraging a library that implements the high-level VSIPL++ API provides a way for SIP software developers to take advantage of the performance potential of parallel and multicore hardware systems while satisfying schedule and cost constraints.

The problem: Getting from prototype to high-performance production code efficiently

Consider the following challenge. A software developer has a Scalable Synthetic Aperture Radar (SSAR) benchmark expressed in 50 lines of MATLAB code. The developer needs to implement that benchmark to achieve good performance on two very different systems – a conventional x86 processor and a heterogeneous, multicore Cell Broadband Engine (Cell/B.E.) processor – in just one week.

For background, Synthetic Aperture Radar (SAR) is used for a variety of imaging and remote sensing applications, including reconnaissance, surveillance, and terrain mapping. To standardize benchmarking this common algorithm, MIT Lincoln Laboratory created a scalable synthetic SAR application as part of the High Performance Embedded Computing (HPEC) Challenge[1]. The benchmark demonstrates the intense computational requirements found in actual systems using synthetic (and scalable) data. The raw radar returns are processed by means of a 2D Fourier matched filtering step, a spatial frequency interpolation step, and a transformation back to the spatial domain. Key mathematical operations, as in any SAR application, include FFTs, matrix multiplication, and interpolation. Thus, a software development technique that offers benefits for this SSAR's benchmark points to an approach that is likely to work for larger SIP applications as well.

The first step: Choosing a library-based solution

If high performance were the only goal, the developer could consider coding the algorithm in a low-level language such as C or assembly code. But the time constraints (only a week), combined with the need to develop for both an x86 and a Cell/B.E. processor, make this approach unworkable.

With a portable library, though, the same application code will run on more than one system. And a library allows a developer to program for an unfamiliar architecture, such as the Cell/B.E., in a familiar language using familiar development tools. The key is to find a library at the right level of abstraction – high enough to provide the necessary primitives for the application domain but low enough to allow for efficient implementations and thus high performance.

For the SSAR benchmark, the VSIPL++ API provides the right level of abstraction. VSIPL++ is an open standard[2], high-level API for parallel high-performance signal and image processing. It is defined by the High Performance Embedded Computing Software Initiative (HPEC-SI – www.hpec-si.org), a consortium of industrial, academic, and governmental partners, with sponsorship from the Air Force Research Laboratory. Its goal is to simultaneously deliver productivity, portability, and performance. VSIPL++ defines a pure C++ interface for operations – including FFTs, filters, linear system solvers, and other mathematical functions – that allow SIP applications to be written at the problem domain level.

For the SSAR benchmark challenge, CodeSourcery used Sourcery VSIPL++, a library that provides an optimized implementation of the VSIPL++ API on x86, Power Architecture, and Cell/B.E. processors with useful extensions to the base VSIPL++ specification.

The next step: Implementing SSAR in VSIPL++

Using the VSIPL++ library, CodeSourcery implemented SSAR in C++ in just four days. To illustrate the relative advantage of the VSIPL++ API for developer productivity, Figure 1 shows three different implementations of the SSAR algorithm's fast time filter: (1) MATLAB, (2) simple, unoptimized C[3], and (3) VSIPL++. Mathematically, they all perform the same computation, but the VSIPL++ version, like the MATLAB version, is easy to understand because it is expressed in SIP primitives such as FFTs and matrix multiplication.

Figure 1: The VSIPL++ implementation of the SSAR fast time filter, like MATLAB, is much more compact than the unoptimized C.

(Click graphic to zoom)

Ignoring the setup of the filter coefficients, the VSIPL++ version of the fast time filter requires a single line, performing two data-parallel operations sequentially. By contrast, the C reference implementation is more verbose and thus more error prone. In addition, because the C code is iterative, it is more difficult to divide among multiple processors or to optimize for different architectures.

The C and VSIPL++ implementations were benchmarked on both a conventional Xeon processor running at 3.6 GHz and a Cell/B.E. processor running at 3.2 GHz. The entire front-end processing chain was run looping over the data 10 times to average out the measurements. On the Xeon platform, the VSIPL++ library used the Intel Performance Primitives (IPP) library v5 and Intel Math Kernel Library (MKL) v7.21 as well as FFTW v3.1.2. On the Cell/B.E. platform, the VSIPL++ library used the Cell Math Library v1.0 and FFTW v3.2-alpha3. The VSIPL++ code needed no changes to run on both architectures; the VSIPL++ library utilized these underlying math libraries without explicit direction from the developer.

Because development costs vary roughly linearly with the number of lines of code, it is important to look at performance and Source Lines Of Code (SLOC) count together to understand the relationship between developer effort and performance benefit. Here, the VSIPL++ version offers both productivity and performance improvements over the C reference implementation on two very different processors. The VSIPL++ version requires 48 percent fewer lines of code and achieves speedups of 68x and 146x on the Xeon and Cell/B.E. platforms, respectively. Tables 1a and 1b show the source line counts and performance of the VSIPL++ and C implementations of the core 2D Fourier matched filtering and interpolation routines.

Tables 1a & 1b: SLOC vs. performance: VSIPL++ requires 48 percent fewer lines of code than C and yet runs 68 times faster even on Xeon.

(Click graphic to zoom by 1.7x)

The VSIPL++ approach offers particular advantages on the Cell/B.E. processor because it relies on the VSIPL++ library's implementation of the SIP primitives. Several of the primitives used in the SSAR algorithm, such as two-dimensional FFTs and the matrix multiplication operations used for filtering in the frequency domain, are computationally intensive and also involve significant data movement to and from the eight Synergistic Processing Elements (SPEs) of the Cell/B.E. processor. Getting good performance from the SPEs requires carefully balancing the input and output of data with the computations being performed. Optimizing a C implementation to achieve comparable performance would greatly increase program complexity and thus would drive up development time and cost.

The extra mile: Unlocking hardware's potential with strategic optimization

Finally, opportunities for optimization were investigated in the VSIPL++ implementation for the Xeon and Cell/B.E. processors. For example, in the fast time filter computation (see again Figure 1), the VSIPL++ code performs an FFT followed by a matrix multiplication. Given large enough data sets, memory accesses in the second step cause cache misses on a Xeon processor, leading to expensive reads from main memory. Rewriting this loop so that each row is processed one at a time (that is, taking the FFT and then performing the vector multiplication) results in a 1.6x speed improvement for this portion of the code. The change is trivial to implement, requiring only a net increase of eight lines of code, yet it yields a 20 percent improvement in the execution time of the entire front-end stage.

On the Cell/B.E. processor, profiling reveals that interpolation takes almost 40 times longer than matched filtering. A large amount of time is spent in a loop over data in the "range" direction (perpendicular to the flight path), performing a polar-to-rectangular coordinate conversion. A contribution from several inputs for each side-lobe of the sinc function used in the interpolation is added to calculate the intensity and phase of the corresponding output pixel. This computation cannot be expressed well using VSIPL++ primitives.

To improve performance, CodeSourcery used a VSIPL++ API extension available in Sourcery VSIPL++ called "user-defined kernels." User-defined kernels allow the developer to write a high-performance computational kernel and still leverage the data-handling aspects of the VSIPL++ library. A hand-coded kernel with 208 lines of code speeds up interpolation from 4.23 seconds to 0.18 seconds, an improvement of more than 23 times that of the original implementation.

On Xeon, the final optimized code runs more than 82 times faster than the C reference implementation. On the Cell/B.E., it was 5.7 times faster than on the Xeon and more than 1,400 times faster than the reference C code. Even modest, easy-to-implement changes can significantly improve performance.

Combining performance, productivity, and portability with VSIPL++

Using a library implementing the open-standard VSIPL++ API made possible the development of a complex application in far fewer lines of code than are necessary in C. Out of the box, this code outperformed the C reference implementation. With limited changes to address performance bottlenecks, performance was further enhanced. And the application remained portable across vastly different architectures.

References:

1. Haney R., et. al. The HPEC Challenge Benchmark Suite. Proceedings of the Ninth Annual High-Performance Embedded Computing Workshop (HPEC 2005), Lexington, MA, September 2005.

2. VSIPL++ Specification 1.0. Georgia Tech Res. Corp. 2005, (or online at www.hpec-si.org).

3. SSCA #3 – SAR sensor processing, knowledge formation, and file I/O. High Productivity Computer Systems (HPCS) website, www.highproductivity.org/SSCABmks.htm, 2007 (Meuse, T.: MATLAB implementation source code, Meng-Ju and McMahon, J.: C implementation source code).

Don McCoy is a software developer in CodeSourcery's High Performance Computing group. He has worked for CodeSourcery since 2005 as a member of the Sourcery VSIPL++ development team. Prior to joining CodeSourcery, Don had more than 10 years of experience developing embedded software applications related to high-speed data processing. He holds a B.S. in Applied Physics from the University of Delaware. He can be reached at don@codesourcery.com.

CodeSourcery 888-776-0262 www.codesourcery.com