PCA devices are programmable computing devices that possess a significantly greater degree of reconfigurability than traditional general purpose computing devices. PCA devices are designed to deliver performance approaching that of Application Specific Integrated Circuits (ASICs) on a wide variety of tasks, and to rapidly adjust to meet changing, task-specific needs. This ability will increase the total efficiency of a computing platform by allowing high utilization of a large portion of the computing resources during all phases of a mission or application. For instance, in signal processing applications the resources may be configured to efficiently support data parallel operations, and then reconfigured (morphed) to support analysis and knowledge processing when data is available.
DARPA is supporting PCA architecture and chip development for four systems: the Raw architecture at the Massachusetts Institute of Technology [3], the Stanford University Smart Memories project [4], the MONARCH project at the University of Southern California Information Sciences Institute [5], and the TRIPS system from the University of Texas at Austin [6].
PCA devices have a pool of configurable computing resources on a single integrated circuit, connected by a configurable communication network. Computing resources consist of such elements as arithmetic logic units (ALUs), floating point units (FPUs), and various types of memory. Each resource typically can operate in one of several modes, and has some configurability within a mode. Some PCA devices support aggregation of several of the computing resources to behave as a single more complex and powerful type of resource. Computing resources are typically placed into a repeating tiled arrangement, with each tile consisting of one or more instruction processors and associated local memory. Communication networks typically consist of a fixed data path from each computing tile to external I/O and memory, as well as a configurable local path from each tile to one or more of its neighboring tiles. The notional, generic depiction of a PCA device shown in Figure 1 suggests some of these key features. Specific PCA devices, however, vary in whether they have one or multiple procesors per tile, the number and type of interconnect networks, and so forth. Although the four PCA systems supported by DARPA all feature a tiled architecture, they vary widely in their details.
Tiled computing resources with dedicated local memory increase the efficiency of processors by reducing the average distance between a processing element and the memory used by that element. With clock rates on general purpose processors constantly increasing, this distance becomes important when it exceeds the distance that memory information can travel during a clock cycle. Each of the processing cores in a tiled configuration is typically smaller and less capable than the processing core on a traditional CPU, but is able to achieve higher utilization. The presence of several such tiles on a single IC allows very high bandwidth and low latency communication between processing cores, which in turn allows applications to be parallelized more effectively than on platforms with less efficient inter-process communication networks (e.g., symmetric multiprocessors or cluster computers).

Figure 1. Generic PCA device architecture.
Two other aspects of PCA architectures contribute to their high performance. One is the high ratio of device area dedicated to computation resources such as ALUs and memory to the area devoted to control overhead. The second is the unusually high degree of control over the configuration and allocation of those resources available to the programmer and compiler.
These features allow PCA architectures to achieve a significantly higher utilization than is typical for modern traditional CPUs, for a wide variety of applications. To achieve these levels of performance, however, the application development tools must be very effective at utilizing the disparate and configurable resources present in a PCA system. In particular, the tools must be capable of identifying the parallelism in an application and partitioning that application to take advantage of the specific resources of a particular PCA chip. New languages and a new framework have been developed that allow application developers to expose data dependencies and opportunities for parallelism to the build chain, and allow the configuration space of PCA platforms to be expressed in a structured and analyzable way.
The MSI has two main goals: to reduce the effort required by tool developers, and to allow productive development of high performance portable applications. The level of effort required for tool development is reduced by standardizing and abstracting multiple portability layers within the MSI. Creating a portable virtual machine abstraction of PCA hardware, as well as portable application level APIs, reduces development effort by presenting new tools and architectures with a common abstraction target.
There are many possible scenarios and situations in which a change, or morph, in the configuration of a PCA system may be desired. The Morphware Forum has identified and categorized the types of these situations to aid in identifying the hardware and software services required by applications, operating systems, and run-time resource managers. This categorization encodes three orthogonal aspects of the attributes of a morph [7]. These are:
· whether the morph is initiated directly by an API call within the application code, or is initiated by the run-time system or compiler invisibly to the application programmer;
· whether the physical resources allocated to the application must change or stay the same; and
· whether the components of the application (or the entire application) continue to execute or are reloaded or replaced.
The set of morph types resulting from these attributes is summarized in Figure 2. At this time morph type 4a, where the platform configuration is determined by the build tools at compile time and set by the run-time environment at load time, is the only morphing type supported by the MSI. Support for additional morph types will be developed in the future.
Many high performance applications process data from a sensor network to solve a physical problem. Examples range from radar and sonar surveillance systems to audio and video multimedia devices. These applications can often be intuitively described by a directed graph of well-defined, computation-intensive tasks (“kernels”). Each kernel in the graph receives data from the system input or from other kernels, processes it, and passes the modified data to still other kernels or to the ultimate system output. The concept is shown in Figure 3. A common constraint of such applications is that data must pass through the graph fast enough to keep up with the real-time sensor input stream. The need for high performance, flexible implementations of applications of this type is one of the major motivators for DARPA’s development of PCA technology.

Figure 2. Taxonomy of Morph Types.

Figure 3. Illustration of the streaming computation concept. (Figure courtesy of MIT.)
To get a sense of the type and scale of a particular example of a streaming sensor application, consider the Integrated Radar-Tracker (IRT) benchmark application [8]. The IRT is an end-to-end specification of a modern intelligence, surveillance, and reconnaissance (ISR) radar system. Motivated by a space-based radar application, it embodies all of the major attributes required in a defense-oriented PCA application test: both streaming and data-dependent threaded computation with multiple sub-types of each (e.g., fast transforms vs. vector-matrix arithmetic in the streaming elements); heavy computational loads; and multiple application-level parallelization and morphing opportunities. Developed by MIT Lincoln Laboratory (MIT/LL), the benchmark consists of a MATLAB simulation that serves as an executable specification, sample data sets, spreadsheets for estimating the computational loading of the application, and instructions for installation and operation. The benchmark is being developed in stages; as of this writing, the initial version of the IRT, which does not yet include all of the functionality described below, is available to the PCA community at the Morphware Forum web site.
Figure 4 shows a very high level view that divides the IRT application into two major blocks, ground moving target indication (GMTI) and feature-aided tracking (FAT), based on computational load and processing type. The FAT block has two options, signature- or classification-aided tracking (SAT or CAT). The total system load is a strong function of the processing parameters and the number of targets to be tracked; one set of default parameters provided by MIT/LL gives an estimated load in excess of six TeraFLOPS (TFLOPS).

Figure 4. High-level decomposition of the IRT benchmark.
The IRT embodies both major computing types important in PCA systems, streaming and threaded. These classes of computing are described further in the next section. The large majority of the processing in the GMTI block consists of a variety of signal processing operations such as polyphase FIR filtering (for the subband analysis and synthesis), matrix-vector operations (adaptive beamforming and space-time adaptive processing), fast Fourier transforms (Doppler filtering), and correlation or convolution (pulse compression). All of these are examples of streaming operations. They apply non-data dependent, fixed (except for parameter choices) kernels to incoming data samples on a continuous or block basis, producing modified data streams or blocks that are input to the next kernel in the functional flow graph. Operations are sequenced in a dataflow manner, with each kernel able to “fire” when all of its inputs are available. Because of the fixed kernels and lack of data-dependent control flow and data production, the system can be deterministically scheduled.
In contrast, the set of operations that implement the tracking processing represented by the FAT block are highly data-dependent, with the computational load depending not only on the number of targets to be tracked, but the actual physical behavior (kinematics) of those targets. As a result, even though most of the load comes from arithmetic calculations, the number and sequencing of those calculations varies with the sensor scenario. In addition, the tracker uses stored databases (reference signatures for a mean-squared error calculation, and a dynamic database of track histories for the kinematic tracker) as well as incoming sensor data.
The IRT can be parallelized in a number of ways, including both data-parallel and pipeline-parallel approaches, with varying levels of granularity for each. The IRT also supports application-level morphing of various granularities in both functional complexity and time scale. For instance, reconfiguring the same PCA resources from GMTI to FAT processing would require a major functional reorganization, probably with very low latency, while operator-selected parameter changes would require only a relatively minor scaling of the existing functional flow and be infrequent and tolerant of greater latency. These parallelization and morphing options provide ample opportunities to exercise PCA and MSI capabilities.