High Performance Memory Communication Architectures for Coarse-grained Reconfigurable Computing Systems
Ph.D. Thesis
Table of Contents

Coarse-grained Configurable Architectures

The range of existing reconfigurable architectures is subdivided in fine- and coarse-grained approaches. Fine-grained devices are optimized to implement glue logic and irregular structures like state machines. In contrast to using bit-level FPGAs the area of reconfigurable computing stresses coarse grain platforms with path widths much higher than one bit. Since datapaths for computation are typically regular, bit-level architectures need a lot of overhead to implement them. Therefore fine-grained architectures are rather unsuitable to form computational datapaths See [Har97] See [MHA97].

Coarse-grained reconfigurable architectures are especially designed for reconfigurable computing. Such architectures provide operator level function units and word level datapaths. This chapter will present several coarse-grained reconfigurable architectures. While most approaches are a 2-dimensional arrangement of processing elements (PEs), the RaPiD approach implements a linear datapath of functional units.

The Communication Networks of Configurable Architectures

The fabrics of configurable architectures are an important factor in the memory communication bandwidth. Besides the consequences on routability and therefore the exploitation of the reconfigurable resources, the communication infrastructure also limits the data access speed for reconfigurable computing. The routing facilities of configurable architectures can be classified in two groups:

Long range interconnect allows to route data via long distances without considerable delay. Therefore long range interconnect adds flexibility to a configurable architecture but requires much more hardware resources than local interconnect. Usually long distances may also be bridged by using a set of local connections. In this context it has shown that the different purposes of configurable architectures of different granularity have also an influence on the routing resources. It has turned out See [HHH00] that long distance interconnect through coarse grain mesh-connected platforms using local interconnect is not expensive - in contrast to fine grain mesh-connected FPGAs like, for example the Algotronix CAL1024 architecture:

The Algotronix CAL1024 FPGA See [Kea89] See [Alg91a] See [Alg91b] See [OD95] is a fine grained rectangular array of 32 by 32 identical cells. The regular architecture is also provided across chip boundaries. The simple logic unit of each cell may implement any combinatorial two-bit function or a D type latch. Besides two global signals, which are intended for clock distribution, the CAL1024 has only nearest neighbor connections. As illustrated in See The array structure and nearest neighbor routing facilities of the Algotronix CAL1024 FPGA [Kea89] [Alg91a] [Alg91b] [OD95]., each cell has four (North, East, West, South) nearest neighbor in- and outputs. Therefore non-local cells must be routed through intermediate cells. The absence of a reasonable number of global routing resources may lead rapidly to routing congestions. As a result many logic units may be unused because of the absence of routing resources. Long range datapath implementation in such a fine-grained architecture is further complicated by the different delays for each datapath bit.

The DP-FPGA Architecture

The Datapath-FPGA (DP-FPGA) See [CL94] See [CL96] is a coarse grained architecture with a four bit datapath width. The design of the DP-FPGA is very similar to conventional fine-grained FPGAs. The basic logic block (see L in See Overview on the architecture of the DP-FPGA.) is built of four identical bit-slices, which share the configuration bits and control signals. See Overview on the architecture of the DP-FPGA. shows the routing architecture of the DP-FPGA connecting the logic blocks. In contrast to other approaches the DP-FPGA provides separate routing resources for data and control signals. While the routing resources for data are four bit busses, the routing resources for control lines are single bits. This saves configuration bits for data routing and doesn't waste routing resources for single bit control signals. A third resource is the shift block, which is included to eliminate some of the restrictions associated with the use of bit sharing.

The DP-FPGA is the first approach, where a reconfigurable architecture has been optimized to implement datapaths. The key idea of this approach was to save hardware resources and configuration bits during datapath implementation.

Because of the DP-FPGA is directly derived from conventional FPGAs, the implementation of applications is very similar to bit level FPGAs, which is far away from simple compilation like known from microprocessors. The design flow of the DP-FPGA will still include a synthesis and conventional place and route step. This method generates irregular structures. It is hard to perform an optimization of the data communication is hard to perform.

The KressArray-1

The KressArray-1 aka rDPA (see See [Her95], See [HK95], See [HKR95b], or See [Kre96]) is an array of reconfigurable processing elements, which are called datapath units (DPUs). See The KressArray-1 architecture. illustrates the KressArray-1 architecture.

The KressArray-1 consists of a regular array of identical datapath units (DPUs). Each DPU has two input and two output registers. The operation of the DPUs is data-driven. This means that the operation will be evaluated as soon as all required operands are available.

The KressArray-1 provides two interconnection levels: short lines for local interconnect, and a global bus for long range interconnect. The topology of an interconnection network can be static or dynamic . Static networks are fixed during run-time. Dynamic networks can be changed during run-time. While normally in commercial FPGAs all routing resources are static, in the KressArray-1 local interconnections are static and global interconnections are dynamic.

The local interconnect of the KressArray-1 is implemented as an unidirectional mesh. Although bidirectional communication is more flexible in implementing expressions, an unidirectional approach has been used for better area efficiency.

The global interconnect provides a connection to each datapath unit (DPU). In the KressArray-1 it is used for I/O of operands from outside into the array, and for propagation of interim results to other DPUs far away. To save area, a time multiplexing of the global interconnect had been considered. A scheduling determines a good usage of this dynamic network. The dynamic interconnect network is implemented by a single I/O bus. The communication is controlled by an external control unit See [Her94]. The data transfers are synchronized data-driven by a handshake like the internal communications.

The KressArray-1 can be regarded as the first real coarse grained reconfigurable array since it provides reconfigurable resources on 32 bit word level. It has been implemented as a generalization of the systolic array1 where the pipelining is performed by the datapath synthesis system (DPSS, See [HK95] See [Kre96]). Therefore, in the first place there was a joint development of reconfigurable architecture and an associated mapping software.

All data communication between the KressArray-1 and the data memory is performed using the global data bus. The order of data accesses is scheduled by the software environment (see See [Kre96]). The execution time of all mapped application operators is determined and data is scheduled to achieve the best computation performance. The scheduling does not consider the memory interface performance, since the data driven global bus slows down the data communication to the data memory.

The RaPiD Architecture

RaPiD (Reconfigurable Pipelined Datapaths) See [ECF96a] See [ECF96b] See [ECF97] is a linear array of datapath units, which is configured to form a (mostly) linear computational pipeline. This array of datapath units is divided into identical cells which are replicated to form a complete array. See Basic cell of RaPiD-1. shows the cell used in RaPiD-1, the first version of the RaPiD architecture. This cell comprises an integer multiplier, three ALUs, six general-purpose datapath registers and three local memories. A typical single-chip RaPiD array contains between 8 and 32 of these cells.

The datapath units are interconnected using a set of segmented buses that run the length of the datapath. The datapath units use a n : 1 multiplexer to select their data inputs from one of the n-2 bus segments in the adjacent tracks. The additional inputs provide fixed zero or feedback lines, which can be used to clear and hold register values, or to use an ALU as an accumulator. Each datapath unit output includes optional registers to accommodate pipeline delays and a set of tri-state drivers to drive their output onto one or more bus segments.

The buses in different tracks are segmented into different lengths to make the most efficient use of the connection resources. In some tracks, adjacent bus segments can be connected together by a bus connector as shown in See Basic cell of RaPiD-1.. This connection can be programmed in either direction via an unidirectional buffer or pipelined with up to three register delays, allowing data pipelines to be built in the bus structure itself.

RaPiD as a linear datapath is a completely different approach compared to 2-dimensional meshes of processing elements. This restriction simplifies application mapping but restricts the design space dramatically. Where communication intensive applications can use both dimensions of other approaches, the capacity of the single routing channel of RaPiD is rapidly exhausted.

RaPiD is the first approach with integrated memory cells, which enables multiple concurrent data accesses. Also the busses integrate pipeline registers, which allows a high data transmission speed. Multiple ports to the outside world continue the concept of high data throughput.

The MATRIX Architecture

The MATRIX (Multiple Alu architecture with Reconfigurable Interconnect eXperiment) architecture is an array of identical 8-bit primitive datapath elements overlaid with a configurable network. Each datapath element contains a 256 words by 8 bit memory, an 8 bit ALU and multiply unit and reduction control logic including a 20 by 8 NOR plane. See MATRIX basic function unit (BFU). shows the MATRIX basic function unit (BFU) and the surrounding network switch architecture is pictured in See The MATRIX architecture: (a) network switch, (b) nearest neighbor interconnect, and (c) length four bypass interconnect..

The MATRIX network is a hierarchical collection of 8 bit busses. The interconnect distribution resembles traditional FPGA interconnect. Unlike traditional FPGA interconnect, MATRIX has the option to dynamically switch network connections. The network includes nearest neighbor connections (See The MATRIX architecture: (a) network switch, (b) nearest neighbor interconnect, and (c) length four bypass interconnect.), length four bypass connections (See The MATRIX architecture: (a) network switch, (b) nearest neighbor interconnect, and (c) length four bypass interconnect.) and global lines.

An more detailed architecture overview can be found in See [DeH96b] or See [MD96]. See [Mir96] presents implementation details.

The basis for the development of MATRIX has been an extensive research of reconfigurable architectures. MATRIX has been the first 2-dimensional mesh, which integrates a large amount of memory into the processing element (BFU). Further MATRIX introduces a hierarchic routing architecture like known from FPGAs for coarse grained architectures. The combination of both enables a high device utilization with a high data throughput. The large amount of distributed memory enables massive concurrent data access. But the distribution of the data memory makes application implementation difficult. The large amount of different routing resources strengthens the application mapping problem.

The Raw Machine

The Raw machine See [WTS97] See [Tay99] can be seen as an coarse grain architecture. It is an array of tiles, where each tile consists of instruction and data memory, an ALU, registers, configurable logic (CL) and a programmable switch that supports both dynamic and static routing (see See The Raw processor: (a) consists of an array of identical tiles. (b) Each tile contains instruction memory (IMEM), data memories (DMEM), an ALU, registers, configurable logic (CL), and a programmable switch with its associated instruction memory (SMEM).). While this style of switchbox-routing is quite flexible, its infinite configurations space involves serious mapping problems. This is known at the latest since the C.mmp project was undertook by the Carnegie-Mellon University in 1971 See [SBN82]. In that project 16 processors were connected to 16 memories via a cross-point switch. The project failed in finding an algorithm to automatically map applications using the switch.

Each tile of the Raw machine supports multigranular (bit-, byte- and word-level) operations, which can be adopted with the configurable logic to a specific application. Each tile is constructed like a RISC2-like pipeline and is programmed like a microprocessor.

The Raw machine can be seen between a processor array and a coarse grained reconfigurable mesh. Each tile of the Raw machine has an instruction memory and may process complex operations sequentially. The integration of a data memory enables concurrent data accesses within all tiles. While the distributed memory makes the implementation of applications difficult, the microprocessor like programming method allows to adapt concepts from multi-processor systems.

The Pleiades System

The Pleiades architecture (see See The Pleiades heterogeneous architecture., See [Rab97] See [WZG00]) is composed of a programmable microprocessor and heterogeneous computing elements (referred to as satellites). The architecture template fixes the communication primitives between the microprocessor and satellites and between each satellite. For each algorithm domain (communication, speech coding, video coding), an architecture instance can be created (with known satellite types and numbers).

To reduce overhead in terms of instruction fetch and global control, the architecture utilizes distributed control and configuration. To achieve distributed control, each satellite is equipped with an interface that enables it to exchange data streams with other satellites efficiently, without the help of a global controller. The communication mechanism between each satellite is dataflow driven. The control means available to the programmer are basic satellite configurations to specify the kind of operation to be performed by the satellite, and configurations for the reconfigurable interconnect to build a cluster of satellites. All configuration registers are part of the processor's memory map and configuration codes are memory writes from the processor's point of view.

The Pleiades reconfigurable architecture aims at low energy consumption by providing a computational platform with mixed programming granularity. The satellites can be microprocessors, FPGAs or coarse-grained reconfigurable dataflow elements.

Pleiades is a heterogeneous architecture, which allows to integrate both fine and coarse grained reconfigurable architectures as satellites into a computing system. The main control comes from a microprocessor which simplifies application implementation. The utilization of fine and coarse grain architectures allows to map an application to the best suited target architecture. The heterogeneous architecture enables also to integrate memories as satellites, which may be accessed via the communication network. The capability of the satellites to communicate independently, makes concurrent data accesses possible and has the potential of high data throughput. But because of the complex design space created by the heterogeneous architecture, the application mapping problem is not yet solved completely.

The KressArray-3

The KressArray-III prototype (see See [HBH97d], See [HHH99a], or See [HHH00]) is illustrated in See KressArray communication architecture by examples: (a) 4 reconfigurable nearest neighbor ports (rNN ports), (b) 8 rNN ports, (c) 10 rNN ports, (d) reconfigurable Data Path Unit (rDPU), use for routing only; (e) rDPU use for function and routing, (f) 2 back buses per row, (g) segmented single buses per column, (h) 2 per column, 3 per row, and (i) different function sets in alternating columns. Some vertical torus structure examples: (j) no shift vertical; (k) vertical shift right (to next column), (l) shift left (to previous column), and (m) double torus.b at See KressArray communication architecture by examples: (a) 4 reconfigurable nearest neighbor ports (rNN ports), (b) 8 rNN ports, (c) 10 rNN ports, (d) reconfigurable Data Path Unit (rDPU), use for routing only; (e) rDPU use for function and routing, (f) 2 back buses per row, (g) segmented single buses per column, (h) 2 per column, 3 per row, and (i) different function sets in alternating columns. Some vertical torus structure examples: (j) no shift vertical; (k) vertical shift right (to next column), (l) shift left (to previous column), and (m) double torus.. It consists of a mesh of processing elements (PE), also called rDPUs (reconfigurable Datapath Units), which are connected to their four nearest neighbors by two bidirectional links with a datapath width of up to 32 bit. Here "bidirectional" means that a direction is selected at configuration time, i. e. is fixed at run time. Nearest neighbor connections are used to transfer operands to/from a rDPU, and, to route other data through a rDPU (See KressArray communication architecture by examples: (a) 4 reconfigurable nearest neighbor ports (rNN ports), (b) 8 rNN ports, (c) 10 rNN ports, (d) reconfigurable Data Path Unit (rDPU), use for routing only; (e) rDPU use for function and routing, (f) 2 back buses per row, (g) segmented single buses per column, (h) 2 per column, 3 per row, and (i) different function sets in alternating columns. Some vertical torus structure examples: (j) no shift vertical; (k) vertical shift right (to next column), (l) shift left (to previous column), and (m) double torus.e at See KressArray communication architecture by examples: (a) 4 reconfigurable nearest neighbor ports (rNN ports), (b) 8 rNN ports, (c) 10 rNN ports, (d) reconfigurable Data Path Unit (rDPU), use for routing only; (e) rDPU use for function and routing, (f) 2 back buses per row, (g) segmented single buses per column, (h) 2 per column, 3 per row, and (i) different function sets in alternating columns. Some vertical torus structure examples: (j) no shift vertical; (k) vertical shift right (to next column), (l) shift left (to previous column), and (m) double torus.), as well as for routing through only (See KressArray communication architecture by examples: (a) 4 reconfigurable nearest neighbor ports (rNN ports), (b) 8 rNN ports, (c) 10 rNN ports, (d) reconfigurable Data Path Unit (rDPU), use for routing only; (e) rDPU use for function and routing, (f) 2 back buses per row, (g) segmented single buses per column, (h) 2 per column, 3 per row, and (i) different function sets in alternating columns. Some vertical torus structure examples: (j) no shift vertical; (k) vertical shift right (to next column), (l) shift left (to previous column), and (m) double torus.d at See KressArray communication architecture by examples: (a) 4 reconfigurable nearest neighbor ports (rNN ports), (b) 8 rNN ports, (c) 10 rNN ports, (d) reconfigurable Data Path Unit (rDPU), use for routing only; (e) rDPU use for function and routing, (f) 2 back buses per row, (g) segmented single buses per column, (h) 2 per column, 3 per row, and (i) different function sets in alternating columns. Some vertical torus structure examples: (j) no shift vertical; (k) vertical shift right (to next column), (l) shift left (to previous column), and (m) double torus.). The large number of nearest neighbor connections goes back to physical design ideas See [HHH00]. Besides the nearest neighbor links, a background communication architecture (called back buses) with one global bus or multiple global buses and/or bus segments, i. e. semi-global buses provides additional communication resources.

Currently the functionality of each PE allows to implement all integer operators provided by the programming language C. The software environment See [HHH99a] See [HHH00], however, also supports rDPUs with other operator repertories, such as e. g. specialized for accelerator usage in a particular application area. Execution inside PEs is transport-triggered, i. e. it starts as soon as all operands needed are available. The data to be processed and the results to be stored can be transferred to and from the array in two different ways: over the global bus and over rDPUs ports at the edges of the array.

MorphoSys Reconfigurable Cell Array

The architecture of the MorphoSys system See [LSL99] comprises five components: the core processor, the Reconfigurable Cell (RC) array, the context memory, the frame buffer and a DMA controller3. See The MorphoSys architecture. gives an overview on this architecture.

While the core processor executes sequential tasks and controls data transfers between the programmable hardware and the data memory, the reconfigurable hardware is dedicated to exploit parallelism available in the application's algorithm. The processor is a RISC4 processor augmented with specific instructions for controlling other MorphoSys components. These special instructions fall in two categories: DMA instructions and RC array instructions. DMA instructions initiate data transfers between the main memory and the frame buffer, and context loading from the main memory into the context memory.

The frame buffer is organized in two sets each of two banks (bank A and B) of 128 16-bit words. A 128-bit operand bus connects the frame buffer with the RC array. All RCs of a row share a 16-bit part of the operand bus. Data transfers between the frame buffer and the RC array can only be performed in interleaved mode, i.e. data accesses can only be performed by alternating bank A and B. While this method speeds-up the access time of consecutive data, it restricts random data access.

MorphoSys contains an 8 by 8 matrix of reconfigurable cells (RC). An important feature of the RC array is its three layer interconnect network, which is depicted in See MorphoSys interconnection layers in the reconfigurable cell array.. See MorphoSys interconnection layers in the reconfigurable cell array.a shows the nearest neighbor layer that connects the RCs in at two-dimensional mesh. Thus, each RC can access data from any of its row/column neighbors. See MorphoSys interconnection layers in the reconfigurable cell array.a illustrates also the second layer, which provides complete row and column connectivity within a quadrant. Therefore, each RC can access data from any other RC in its row/column in the same quadrant. See MorphoSys interconnection layers in the reconfigurable cell array.b illustrates the third layer, which supports inter-quadrant connectivity. It consists of buses, that run along the entire length of rows and columns, crossing the quadrant borders.

The MorphoSys reconfigurable cell (RC) is the basic programmable element. As See MorphoSys reconfigurable cell (RC) architecture. shows, each RC comprises five components: the ALU-Multiplier, the shift unit, the input multiplexers, a register file with four 16 bit registers and the context register. The ALU-Multiplier has four data input ports. Two 16-bit ports receive data from the input multiplexers, one 32-bit port takes data from the output register and a 12-bit port takes an immediate value in the context word. In addition to standard arithmetic and logical operations, the ALU-Multiplier can perform a multiply-accumulate operation in a single cycle.

The input multiplexers select one of several inputs for the ALU-Multiplier. Multiplexer MUX A selects one input from: (1) four nearest neighbors in the RC Array, (2) other RCs in the same row/column within the same RC array quadrant, (3) the operand bus, or (4) internal register file. Multiplexer MUX B selects one input form: (1) three of the nearest neighbors, (2) the operand bus, or (3) the register file.

The context register provides control signals for the RC components through the context word. The bits of the context word directly control the input multiplexers, the ALU-Multiplier and the shift unit. The context word determines the destination of the result, which can be a register file and/or the third layer connection. The context word also has a field for an immediate operand value.

The CHESS Array

CHESS is a reconfigurable arithmetic array developed by HP Labs See [MSK99]. It consists of 4-bit ALUs and the connections are 4-bit buses. In contrast to other arithmetic-oriented reconfigurable architectures CHESS is a chessboard style array (see See The CHESS array: (a) floorplan with embedded RAMs, and (b) detailed floorplan with nearest neighbor wiring.). Each ALU is adjacent to four switchboxes, and each switchbox is adjacent to four ALUs (See The CHESS array: (a) floorplan with embedded RAMs, and (b) detailed floorplan with nearest neighbor wiring.). This allows powerful local connectivity, with each ALU having input and output buses on all four sides, and being able to send data to or receive data from any of the eight surrounding ALUs.

At run time, any switchbox in a CHESS array may be used as a 16 word by 4 bit memory. In this mode all of the switches are disconnected, although buses running over the switchbox can still be used. If large numbers of switch boxes are used as RAM the routing capability of the array will be reduced. To avoid this problem, the array design supports also embedded block RAMs. These 256 word by 8 bit memories are distributed through the array as shown in See The CHESS array: (a) floorplan with embedded RAMs, and (b) detailed floorplan with nearest neighbor wiring..

The fundamental computational unit is a 4 bit ALU, with a primary set of 16 instructions. This provides capabilities, suitable for cascading to useful media widths, or for nibble-serial implementations. ALU instructions can be either static or dynamic. Constant instructions are stored as a part of the configuration. Dynamic instructions are generated via user circuitry that is connected to the instruction input of the ALU. This allows instructions to be changed on a cycle-by-cycle basis, to support predicated execution, to execute sequential "engines", to implement specialized processors, or to give the effect of fine-grain reconfiguration.

The design of the CHESS array has been optimized to reduce communication delay. Similar to the KressArray the CHESS array favors local interconnect between processing elements. But the CHESS array has no global bus. Long range interconnect is achieved by connecting multiple bus segments similar to the routing of FPGAs, but longer buses may be pipelined in CHESS eliminating long delays.

Similar to other approaches CHESS integrates multiple embedded memories to enable concurrent data access. These memories may be accessed via the memory interface from outside.

The DReAM Array

The Dynamically Reconfigurable Architecture for Mobile Systems (DReAM, See [BG00] See [BGA00a] See [BGA00b]) is an array of coarse-grained Reconfigurable Processing Units (RPUs). It is implemented using a 0.35 μm CMOS standard cell process by Mietec/Alcatel. The DReAM architecture connects all RPUs with local and global communication structures (see See The DReAM hardware structure: (a) architectural overview, and (b) the reconfigurable processing unit (RPU).a). While the local routing structures are nearest neighbor connections between RPUs, the global interconnects are long range busses, segmented by switching boxes. The routing facilities support also dynamic routing.

The RPU structure (See The DReAM hardware structure: (a) architectural overview, and (b) the reconfigurable processing unit (RPU).) has been optimized for the requirements of mobile communication systems See [BGA00b]. The RPU is designed to perform arithmetic data manipulations. As shown in See The DReAM hardware structure: (a) architectural overview, and (b) the reconfigurable processing unit (RPU). the RPU consists of:

The dual port RAMs are used as look-up table, when performing multiplication operations. Further the RAMs may also be used as FIFO or as data memory to store intermediate results. Each RAP unit integrates two barrel shifters and an ALU.

Conclusions

See Summary of the technical details of the different coarse-grained reconfigurable architectures. summarizes the characteristics of the presented coarse-grained reconfigurable architectures. This architectures have been presented for two reasons. They are the most likely target platform (see also See [Har97] See [MHA97]) for high speed reconfigurable computing machines. Since such architectures provide high computation power, they also suit well to be the target platform for address computations. Therefore this thesis will use this technology to develop a generator for application specific data sequencers. But the quality of supporting address computations varies form architecture to architecture:

The embedded memories of many approaches enable a high degree of concurrent data accesses. But the distributed memories are rather small and do not suit to store application data. Furthermore the propagation of addresses computed by a single address generator would be quite difficult. The reason is the large amount of routing resources and the propagation delays.

As shown in See Summary of the technical details of the different coarse-grained reconfigurable architectures. the presented coarse-grained reconfigurable architectures follow different approaches to implement the communication infrastructure:

  1. Summary of the technical details of the different coarse-grained reconfigurable architectures.

 

Introduced in

Source of Information

Architecture

Granularity

Fabrics

Embedded Memory

DP-FPGA

1994

See [CL94]

2d mesh

1 and 4 bit

4 bit
connections, FPGA style

no

KressArray-1

1995

See [HKR95b]
See [Kre96]

2d mesh

32 bit

nearest neighbor, global bus

no

RaPID

1996

See [ECF96a]
See [ECF97]

1d array

16 bit

segmented busses

yes

Matrix

1996

See [DeH96b]

2d mesh

8 bit

nearest neighbor, length four and global lines

yes

Raw Machine

1997

See [WTS97]

2d mesh

multi-granular

nearest neighbor switched connections

yes

Pleiades

1997

See [Rab97]

2d mesh

multi-granular

communication network

yes

KressArray-3

1997

See [HBH97d]
See [HHH00]

2d mesh

32 bit

nearest neighbor, long lines, global bus

no

MorphoSys

1999

See [LSL99]

2d mesh

16 bit

nearest neighbor, length two and three, and global lines

no5

CHESS

1999

See [MSK99]

2d mesh, chessboard style

4 bit

nearest neighbor

yes

DReAM

2000

See [BGA00b]

2d mesh

16 bit

nearest neighbor, segmented busses

yes

Switchbox Routing

Switchbox routing is implemented in the DP-FPGA (See The DP-FPGA Architecture), the Raw machine (See The Raw Machine), and Pleiades (See The Pleiades System). As already discussed in See The Raw Machine switchbox routing is very flexible, but opens an infinite design space for application mapping. This is already known form the C.mmp project, which was undertook by the Carnegie-Mellon University in 1971 See [SBN82].

Multi-Length Wiring

Multi-length wiring is implemented by Matrix (See The MATRIX Architecture) and MorphoSys (See MorphoSys Reconfigurable Cell Array). While it is flexible, it opens a large design space, which results in a lack of application development support. A further disadvantage is a certain overhead of chip area needed by the long range connections.

Routing Channel

The RaPID architecture (See The RaPiD Architecture) is a linear array of datapath units, which uses segmentable busses as a kind of routing resource. The simplicity of this architecture supports automatic application mapping. The limited design space supports automatic application mapping, but the limited capacity of the routing channel restricts the design space.

Nearest Neighbor Connections Only

The CHESS array provides nearest neighbor routing resources only. As known from the Algotronix CAL1024, the availability of nearest neighbor connections only may lead to routing congestions (see introduction of this chapter). Therefore long distance routing may be expensive in systems with nearest neighbor connections only. But this rule does not necessarily hold for coarse grain platforms. In coarse grain reconfigurable systems it is feasible to have rich routing support also by cells already assigned to functions. In such systems the restricted design space eases application mapping.

Nearest Neighbor Connections with Global Backup-Bus

Nearest neighbor connections with a global backup-bus is implemented in the KressArray. It has turned out that a simple local routing scheme without much long range routing is sufficiently powerful for high speed communication in coarse-grained reconfigurable architectures. Furthermore it eases application mapping efficiently See [HHH00]. Routing congestions, like known from architectures with nearest neighbor connections only, are avoided by the utilization of a global backup-bus.

Therefore in this thesis the KressArray-3 with no embedded memory will be used as a basis to develop a reconfigurable device / memory infrastructure to alleviate the memory communication problem. The KressArray-3 with the Xplorer Tool See [HHH00] provides very high flexibility in designing array communication architectures, which also supports a rich choice of possible array memory interface architectures. It has been a goal of this thesis, to exploit this flexibility for very powerful memory interfacing with special address generator support. This has been a challenge and a chance to find a better solution than known before. To cope with the memory bandwidth problem associated with KressArray use there are two degrees of freedom: to influence the KressArray design, and to develop a memory interface architecture for it.

Memory Technologies

The memory bandwidth problem outlined in the introduction may be targeted on two different levels:

As established in the introduction memory architectures developed for conventional computers are pointless for reconfigurable accelerators. Therefore this thesis will develop a novel memory organization for reconfigurable computing machines. But for high throughput conventional computers and also reconfigurable accelerators have to rely on current memory devices. Therefore the first part of this chapter focuses on state of the art memories and how they try to accelerate memory accesses with an intelligent built-in interface. The second part of this section discusses ways to interface memories to reconfigurable platforms.

Current Memory Devices Suitable to Form Large Data Memories

In this section the most recent memory implementations and their access optimization techniques are introduced. Here only Dynamic Read Access Memories (DRAM, See [Bae94]) are considered. Currently DRAMs are the only memory technology to implement large data memories in suitable size and prize. Other memory technologies like dedicated memories for graphic adapters or simple Static Read Access Memories (SRAMs, See [Pri96]) are not considered, because these technologies are to expensive and their integration density is too bad.

The main difference of the presented memory devices is their interface. The internal memory block is quite similar for all kinds of DRAM. Its integration and speed mainly depends on current semiconductor technology. A good discourse on this topic is given in See [II99].

All presented DRAM types support burst accesses. A burst access to a memory is a single access to multiple consecutive memory locations. A burst of the length n accesses n consecutive memory locations. Usually such memories integrate a counter which is set with the initial address and then the consecutive addresses are generated by the counter. This technique allows to access multiple memory locations by applying only one address. Burst accesses were invented to fill complete, fixed width cache lines in a microprocessor system within one memory access.

Synchronous DRAM

The first DRAMs which operated synchronously to the system clock were Synchronous DRAMs (SDRAM, See [Pri96]). SDRAMs support burst read accesses with a programmable burst length of 1, 2, 4, 8 words or full page. Since the programming of the burst length is costly, it is done only once. SDRAMs are organized in two or four banks, which allows to hide partially precharge and permits interleaving between the banks (see See [Pri96] for detailed information).

Multibank DRAM

Multibank DRAM (MDRAM, see also See Multibank DRAM Technology) has been developed from SDRAM by Mosys See [MoS96] and was fabricated under licence by Siemens Semiconductor See [Sie96] See [Sie97]. MDRAM integrates 32 or 36 banks, which allows to hide activation and precharge of banks completely and permits interleaving between all banks. Data is transmitted on a 16 bit bus at both edges of the clock signal, which doubles the memory interface bandwidth.

A very interesting feature is the variable length burst, i.e. the burst length is not fixed and can be different for each read and write access. Therefore MDRAM is not programmed by a specific burst length like SDRAM but has a command bus, where a stop signal is sent. While this feature makes the MDRAM more flexible, the memory controlling is quite complicated.

Double Data Rate Synchronous DRAM

The Double Data Rate Synchronous DRAM (DDR-SDRAM, See [All98] See [Jed99]) has also been developed from SDRAM to compete with Rambus technology. The double data rate technology transmits data on both edges of the clock signal. DDR-SDRAM technology is cheaper than Rambus DRAM and expected to be of similar performance. DDR-SDRAM integrates four internal banks to improve hiding of activation and precharging compared to SDRAMs. It supports 2, 4, or 8 word read and write bursts. The burst length is programmed only once at power-up.

Rambus DRAM

Rambus DRAM (RDRAM, See [Ram99a] See [War99]) was developed from the traditional DRAM, but the architecture has been streamed and optimized to yield new performance. Data is read in packets of 16 bytes at a very high clock speed. The RDRAM modules are only 16 bit wide, but they transmit data at both edges of a 400MHz clock. Therefore the RDRAM chips have to be placed very close to the CPU to reduce radio noise.

Conclusions

See Summary of the technical details of the different DRAM types. summarizes the technical details of the presented DRAM types. All of the above DRAM technologies provide enough capacity to implement a sufficient large data memory. Because of SDRAM is the precursor of MDRAM as well as DDR-SDRAM, DDR-SDRAM and MDRAM have been further optimized and provide generally a better bandwidth. The main reason for this is the increased number of internal banks and the double data rate interface. Since the burst length of DDR-SDRAM is pre-programmed, it is less flexible than MDRAM. While the pre-programmable burst length is good for the hierarchic memory of microprocessors it limits the flexibility of reconfigurable accelerators.

The Rambus technology hides the memory device from the processor in a way that multiple memory requests may be performed at a time and carried out later by the Rambus controller. This proceeding is unsuitable for generic data sequencing. Here a closer coupling between the memory device and the data sequencer is desired to perform optimized accesses.

  1. Summary of the technical details of the different DRAM types.

 

SDRAM

RDRAM

MDRAM

DDR-SDRAM

Introduced in

1992

1993

1996

1998

Technology Generation Presented in This Table

1999

1999

1997

1999

Source of Information

See [Ibm99a]

See [Ram99b]

See [Sie97]

See [Ibm99b]

Voltage

3,3 V

2,5 V

3,3 V

2,5 V

Internal Banks

2 / 4

32

32 / 36

4

Burst Type

read burst
1, 2, 4, 8,
full page

package
(16 bytes) oriented transfers

read and write 1-32 words (=32 bit)

read and write
2, 4, 8

Data Interface

16 bit

16 / 18 bit
double data

16 bit
double data

8 bit
double data

Interface Speed

up to 133 MHz

up to 400 MHz

up to 166 MHz

up to 143 MHz

Peak Performance6

266 MB/s

1,6 GB/s

664 MB/s

5727 MB/s

The Data Memory Organization of Reconfigurable Systems

Memory bandwidth is more than a decade the focus in computer design. The processor speed raises much faster than the memory access speed (see See Processor-memory performance gap (taken from [PAC97])., taken from See [PAC97]). For accelerators this discrepancy is even more serious. Therefore this section focuses on the memory organization of reconfigurable systems. Several methods will be discussed and their characteristics will be shown.

Since data access is an important topic in reconfigurable computing, the way how data memory is attached to the reconfigurable datapath has a big influence on the system performance See [HBH97c]. Many reconfigurable computers follow the approach of one or several memory banks outside the reconfigurable device (see See Different possibilities to attach data memory to a reconfigurable architecture: (a) external memory, (b) internal memory block, (c) reconfigurable architecture surrounded by internal memory, and (d) embedded memory.). In that way large data memories can be implemented with commercial devices. The disadvantage is, that data is transmitted via chip boundaries. Long signal lines limit the interface speed. Additionally pinout restrictions of the devices limit the amount of data transferred in parallel. Reconfigurable devices with internal data memory practising integrated data sequencing eliminate the off-chip communication bottleneck. The device pins are not needed for data access and the communication datapaths are rather short. Increasing device integration and IP-based design methods support this approach.

Internal memory may be distinguished in on-chip memory blocks and in "logic in memory", which is a fine/medium grain merging of memory and logic (e.g. IRAM See [KP98] See [KAP97] See [PAC97]). While logic in memory may also be used for reconfigurable logic (e.g. see See [KST99]), there is hardly any systematic method for an efficient application development known. At the moment such platforms are not suited for complex applications.

Currently all architectures with internal data memory struggle with the limited memory resources. But the increasing integration density eradicates this problem. There are several ways how data memory is integrated into the reconfigurable architecture. Many commercial FPGA devices allow to allocate configurable space to be used as memory (see See Different possibilities to attach data memory to a reconfigurable architecture: (a) external memory, (b) internal memory block, (c) reconfigurable architecture surrounded by internal memory, and (d) embedded memory.). In contrast to external data memory here already the memory interface speed is faster. But this approach has a bad area efficiency, and it is limited to a few internal memory banks. In contrast to this, architectures with embedded memory are much more area efficient. Small SRAM blocks are regularly integrated into the reconfigurable architecture (see See Different possibilities to attach data memory to a reconfigurable architecture: (a) external memory, (b) internal memory block, (c) reconfigurable architecture surrounded by internal memory, and (d) embedded memory.). This is implemented in commercial devices like the Altera FLEX10k CPLD See [Alt98], the Vantis VF1 See [ACS99] or a new FPGA architecture by Actel See [KBK99] as well as into coarse-grained approaches as CHESS or others (see See Coarse-grained Configurable Architectures). All these devices are well scalable because of the regular distribution of the memory blocks. But the distribution of memories makes the reconfigurable architecture irregular for the application mapping. This may lower the device utilization. Further a great disadvantage of distributed memory is, that it is hardly accessible from outside. Therefore distributed memory is well suited to store intermediate results but it eats up many resources to make it accessible from the outside world for data in- or output. This is also the case if a larger memory has to be formed from several embedded blocks.

Another way to integrate memory blocks into a reconfigurable device is to surround the reconfigurable media by memories (see See Different possibilities to attach data memory to a reconfigurable architecture: (a) external memory, (b) internal memory block, (c) reconfigurable architecture surrounded by internal memory, and (d) embedded memory.). While this restricts the scaleability, the reconfigurable architecture remains regular, which eases application mapping. A further advantage of this approach is, that the memory banks are accessible from outside, without acquiring reconfigurable routing resources. An example for this approach is the Xilinx Virtex FPGA See [Xil99] (see See Reconfigurable Architecture Surrounded by Memory).

The following subsections examine the data memory implementation of existing reconfigurable computing systems.

Reconfigurable System With External Memory

Reconfigurable systems with external memory differ in the type of utilized memory and in the number of memory blocks. A simple solution for data memory is SRAM. It is easy to handle (e.g. no refresh, precharge etc.) and has short access times. The disadvantages of SRAM is the price and the device size.

To implement large data memories, necessary for many applications, DRAM suits better than SRAM. It is cheap and available in smaller devices. The disadvantage is, that DRAM needs a sophisticated memory controller to generate refresh cycles and to perform the more costly accesses (i.e. perform bank activation and precharge and so on). Due to the well developed DRAM technology some reconfigurable systems already benefit from access optimization techniques: e.g. PRISM-II (See The PRISM-II System) performs accesses in burst mode, or Riley-2 (See Riley-2) uses Burst Enhanced Data Out (BEDO) DRAM See [Mic95a] See [Mic95b].

A popular hardware level access optimization in reconfigurable systems is concurrent access to parallel memory. Several independent memory banks are attached to the reconfigurable datapath expecting that data can be clever distributed and accessed concurrently. Unfortunately this leads very often to inconsistencies or a large routing overhead. This method can be applied to SRAM-based systems as well as to DRAM-based. As a result reconfigurable systems with external data memories can be grouped in the following classes:

Single SRAM
Parallel SRAM

In some examples of this class, each reconfigurable device of the architecture is connected to only one SRAM but the reconfigurable datapath is formed by multiple reconfigurable devices interconnected with each other. Therefore multiple SRAMs are connected in parallel to the reconfigurable datapath.

Single DRAM
Parallel DRAM
Reconfigurable Architecture With Virtual Internal Memory

Reconfigurable architectures with single or a few irregular internal data memories are conventional FPGA architectures with reconfigurable resources allocated as memory. Usually the memory of multiple look-up tables is united to larger memory blocks. Many commercial FPGA architectures integrate this feature. Some examples are the Xilinx XC4000, Spartan and Virtex families See [Xil99]. Because of the method how this kind of internal data memory is implemented, it is very area inefficient. This approach is only used for small memories, e.g. to store intermediate results.

Reconfigurable Architecture With Embedded Memory

Also many reconfigurable architectures integrate real SRAM blocks into the same device. Usually such memories are spread regularly over the architecture. Since the memories are integrated in between reconfigurable media, they are also rather small.

Most of the coarse-grained architectures presented in See Coarse-grained Configurable Architectures (RaPiD, MATRIX, Raw and CHESS) integrate this kind of internal memory, but also some fine-grained architectures. The remaining part of this section sketches two commercial FPGAs with embedded SRAMs.

The Altera FLEX10k CPLD Family

The Altera FLEX10k CPLD family is a fine grained architecture (see See [Alt98]). The CPLD has SRAM cells (embedded array block, EAB, see See Altera FLEX 10k CPLD family device block diagram with embedded array blocks (EAB).), which can be used as memory as well as function generators. Each EAB provides 512x8 bits.

The Actel ES Family

The Actel ES FPGA family presented in See [KBK99] is a fine-grained architecture with embedded SRAMs. Each B16x16 tile (See The Actel ES FPGA: (a) overview on a 4-tile device, and (b) B8x8 utilities with embedded SRAM.) has 2 blocks of 512x9 bit SRAM (See The Actel ES FPGA: (a) overview on a 4-tile device, and (b) B8x8 utilities with embedded SRAM.).

Reconfigurable Architecture Surrounded by Memory

This class of reconfigurable devices with internal memory uses a regular block of reconfigurable media, which is surrounded by memory. In this thesis a KressArray architecture surrounded by memory blocks is considered (see See The Data Sequencer Mapped to the KressArray). A fine-grained commercial FPGA following this approach is the Xilinx Virtex family:

The Xilinx Virtex FPGA Family

The Virtex FPGA See [Xil99] shown in See Xilinx Virtex series [Xil99]: (a) architecture overview, and (b) BlockRAM size of different devices. comprises two major configurable elements: input/output blocks (IOBs) and configurable logic blocks (CLBs), which can also be allocated as memory (see See Reconfigurable Architecture With Virtual Internal Memory). The VersaRing I/O interface provides routing resources around the periphery of the device. The architecture also includes clock DLLs (delay-locked loop) for clock distribution delay compensation and clock domain control.

On both sides of the CLB array, the Virtex architecture has dedicated Block RAMs of 4096 bits (BRAMs). BRAM memory blocks are organized in columns. All Virtex devices contain two such columns, one along each vertical edge. These columns extend the full height of the chip. Each BRAM memory block is four CLBs high, and consequently, a Virtex device 64 CLBs high (see XCV1000 in See Xilinx Virtex series [Xil99]: (a) architecture overview, and (b) BlockRAM size of different devices.b) contains 16 memory blocks per column, and has a total of 32 blocks.

Conclusions

The number and variety of useful memory products on the market is growing. Therefore there is a good chance to solve the more difficult memory bandwidth problem for powerful accelerators. As the semiconductor continuously increases the integration density, sooner or later memories will be placed as IP cores on the same chip as the reconfigurable platform.

Currently most existing reconfigurable accelerators are based on the external data memory concept. The limited number of off-chip pins and the reduced speed of off-chip communication limits the data throughput. This leads to a discrepancy between data throughput and memory bandwidth. Therefore such accelerators struggle with a similar bottleneck as known from von Neumann systems.

The main reason for these implementations is, that most commercial reconfigurable devices still integrate only a little internal memory. But with the increasing chip density internal data memory becomes feasible. The trend to IP cores and systems on chip (SoC) confirm the use of internal data memory. Internal data memory implementations are not limited by device pins and achieve higher access speed because of the on-chip communication. Using internal memory blocks, the number, organization, and integration style of internal memories influences the memory bandwidth and the efficiency of application mapping support. HDL support is often area inefficient and complicated to use, if internal memory blocks are utilized. But this depends on the mixture of memory and logic:

Many FPGAs support to allocate reconfigurable resources as internal data memory. While this method complicates application design and results in an inefficient placement and routing, it provides only very little memory resources. Therefore this approach is not suited to implement data memories of feasible size for reconfigurable computing.

The embedded memories can be seen as a coarse grain logic in memory approach, which brings the memories into the reconfigurable datapath. But the mixture of logic and memory makes the implementation of a global control difficult. Also the programming from outside and the union of multiple banks to form a larger memory is complicated. This demands a lot to an associated application mapping software and results in suboptimal mappings.

A reconfigurable architecture surrounded by data memory on the same chip avoids a mixture of logic and memory. Currently it is the most promising approach of internal data memory. Longer communication paths on chip are compensated by easy access from outside and the ability to form larger memories through the union of multiple banks. Application mapping is simplified, because the reconfigurable cells are not fractionated by the memory blocks.

Earlier Address Generators

See A simple instruction sequencer (software-driven). pictures a simple address generator. In procedural processors such an address generator is usually a part of the instruction sequencer. For complex address calculations such machines must use the ALU. In principle address calculation by software requires memory cycles to access constants and variables as well as instruction fetches. Consequently inadequate address generators burden the memory interface with additional load and slow down computations.

In See [HHW90] a grid-based design rule check (DRC) has been published, where the MoM-1 (see See The Address Generator of the Map-oriented Machine 1) with a dedicated address generator achieved a speed-up of more than 2000 compared to a VAX-11/750. The reason for this high acceleration has been the high address calculation load (90% of the CPU time) of the VAX-11/750. Such examples demand to take steps to reduce the addressing overhead in order to speed up the memory communication. Since the latency of data accesses determines often the computation time (See [PAC97] includes a study on this topic), address generators have the potential to reduce the computation time significantly. Hardware support for address generation may be quite effective in featuring parallelism of sequencing operations.

This chapter gives an overview on specialized address generator implementations. The presented approaches focus on the address generation overhead in the following topics:

The Structured Memory Access Machine

The Structured Memory Access (SMA) machine See [PD83] is based on sequential software as conventional von Neumann machines but also features configuration style programming before execution. Since referencing data structures is the prime cause of address calculation overhead. The SMA machine reduces this overhead with a special hardware to generate data structure references. Therefore the SMA machine (See Overview on the Structured Memory Access (SMA) machine.) is divided in two processors: a computational processor (CP) and a memory access processor (MAP, see See Memory Access Processor (MAP) of the Structured Memory Access (SMA) Machine.). The CP is strictly used for the computational process, while the MAP is responsible for the access process, i.e. generating all addresses for data and instructions. For computations, the CP receives an entire instruction block, which is hold in an internal buffer. The CP may execute in loop mode by iterating over one or more blocks of instructions in its buffer.

Blocks are obtained from the analysis of programs. Instructions of a program are divided into blocks such that blocks are entered only at the first instruction and execution always proceeds sequentially to the last instruction. A block may have one, two or more successor blocks. The control structure of a program is well identified by a graph in which nodes are blocks and arcs point to successor blocks.

Memory accesses inside blocks can be distinguished into absolute accesses always accessing the same data and accesses relative to some index value. The SMA machine implements the function of index registers by using a hardware stack. This stack tracks the active indices of inner loops during program execution, and all data structure references are made by using a subset of these index values. To reduce the access process input, tables in the SMA processor are used to store the base address of each data structure and other information necessary to generate an entire address from indices. These tables must be loaded before any instruction, which uses them, is executed. The tables are loaded once at the beginning of a program execution.

The MAP, as shown in See Memory Access Processor (MAP) of the Structured Memory Access (SMA) Machine., has an internal Operand-Instruction Buffer (OIB) to hold its instructions and the operand specifications of the CP instructions. The MAP can also operate in a loop mode fashion. Operation of the MAP is, to a great extend, independent of the CP. When the MAP begins receiving instructions, it forwards the MAP instructions and the operand specification portions of CP instructions to its OIB. The opcode and the register tag portions of CP instructions are forwarded to the CP instruction buffer. The MAP immediately begins generation of operand addresses. The operand addresses are then placed on the read or write queue of the outstanding memory requests. Write data is produced in the same order as corresponding write addresses. Thus when write data is produced, it is paired with the appropriate queue write address. As soon as a read request is serviced, the operand returned by that request is forwarded to the CP or to the MAP tables. With such a scheme, reads are performed early, writes are done late, the CP concentrates on the useful calculations of a program, and the MAP is left with the important, but overhead-related, generation of addresses.

Data Types

The SMA implementation distinguishes among four types of operands: (1) immediate operands, (2) scalar operands, (3) data structure operands, and (4) index operands. Immediate operands are data whose values are embedded in an instruction. An index operand is one of the current indices found on the index stack. The index operand is used only to read a current index value from the index stack and transfer its value to the CP. An index operand differs from a scalar operand primarily in that the index operand originates from the MAP while the scalar operand originates from the memory. The operand type may be specified in a subfield of an instruction's operand field or it may be implicitly associated with a particular instruction. Additionally, an indirect addressing mode is provided specifically for use in the calling of subroutines and in the accessing of data items from structures such as linked lists. As with operand type specification, indirect addressing may be specified in a variety of ways.

At some time it may be desirable to distinguish explicitly among several types of data structures. Instead of having a data structure operand, one my wish to have an array operand, a linked list operand, a binary tree operand, etc. For each operand type, some special accessing mechanisms would be provided to improve the speed with which an operand address is generated. Accessing mechanisms as implemented allow the instruction code to reference structured data simply by pointing to the mechanism, which then references the next data in the established pattern.

Scalar Data

Scalar data is treated in the manner of a vector rather than as a set of disassociated items. The specification for a scalar operand includes a specification of a MAP base register and a displacement into a scalar data area in memory. The MAP can have more than one scalar base register to aid in the accessing of local and global variables, such as during subroutine calls. Such a base register can be used as the argument pointer during a subroutine call.

Index Operations

The SMA's memory access processor has special mechanisms to track indices for data structure computations. These indices are used to generate the addresses for specific items of the data structure to be referenced. An index is specified by its current value, final value, step-size and indexing level. When the index is first established, the current value is equal to its initial value. At any time, several indices may be active; and the level, or nesting, of these indices is dictated by the time at which they were instantiated, or set up. In the SMA, the current value, final value, and step-size of an index are kept on a LIFO (last in first out) stack structure known as the index stack (IS). Each stack position is numbered sequentially, with the bottom of the stack numbered level 1. This convention provides a convenient way of referring to the current value of any active index because the bottom of stack entry corresponds to the outermost level of nesting, i.e. level 1. Stack continuation in memory can be provided for overflow.

When a setup index instruction is executed by the MAP, the initial value, final value, and step-size are pushed onto the stack. To change the current value of an index, an increment index instruction is used. This instruction must specify three items: the level of the index which is being incremented, and two initial addresses of blocks which are the targets of a branch outcome. If the current value of the index is less than the final value, control is transferred to the first block which is specified. If, on the other hand, the current value equals or exceeds the final value, control is transferred to the second specified block.

By checking the index value early, during each increment index instruction, and by having the branching information available, the next instruction can start being accessed while the CP is still performing the final computations of a loop. Furthermore, no guess is made about which direction an index-based branch will take, thus no time is wasted in fetching potentially unnecessary blocks of instructions from the memory.

When the current value of the index has reached its final value, that index should be at the top-of-stack and it is removed (popped) from the stack. Two other methods of removing indices from the stack are (1) the "remove index" instruction which removes the highest level current index from the top of the stack and (2) a "clear all indices" instruction which removes all indices from the stack.

To save loading of index instruction parameters from memory, the MAP is loaded with a set of templates for these values at the start of program execution. A template is a specification of the values needed to initialize an index on the index stack. Templates are loaded into an index template table. When an index is set up in the IS, the IS is loaded directly from the index template table. For a particular program, the number of distinct templates could be fairly small. For example, analysis of a Gaussian elimination program shows that 995 dynamic index setups are required, representing 16 static index setups, but only 8 templates are needed. Each index activated with a particular initial specification can use the same entry from the index template table. Even if the number of templates exceeds the table size, judicious reloading limits overhead.

The Address Generator of the Map-oriented Machine 1

The Map-oriented Machine 1 (MoM-1) was initially called PISA machine (pixel-oriented system for image analysis) See [HHH84a] See [HHH84b] See [Hir85]. It is a pure image processing machine and has been primarily designed to implement a grid-based design rule check (DRC, See [HHW90]). As already mentioned in the introduction of this chapter, it achieved a speed-up of more than 2000 compared to a VAX-11/750. The reason for this high acceleration has been the high address calculation load (90% of the CPU time) of the VAX-11/750. The complete MoM-1 architecture will be introduced in See The Map-oriented Machine 1.

The address generator of the MoM-1 is called move control unit (MCU). It is an application specific generic address generator, which calculates automatically addresses for a 2-dimensional data memory in two steps. Since it is configured before execution time, the MCU needs no instructions at runtime. Therefore the only memory cycles at runtime are data accesses.

On the first address generation stage, the MCU generates independent x- and y-addresses to access the 2-dimensional data memory. It is designed to perform row-wise data accesses across the memory. The number of possible address manipulations is restricted to:

A scan window concept is implemented by the second address generation step. Based on the current address it may generate up to four different address-values for each dimension:

 

Since there is an address manipulation unit for each (x- and y-) address, any combination of the above variations can be performed on both addresses at the same time. This way the MoM-1 generates simple video scans (see See The Video Scan) and accesses a 3 by 3 scan window (see See The 2-stage address generation.) in a fixed order. All actions of the MCU are controlled by a finite state machine (FSM). The overall architecture of the MCU is pictured in See The Move Control Unit (MCU) of the Map-oriented Machine 1.

A Synthesis Method for Address Generators

D. Grant, P. Denyer and I. Finlay describe in See [GDF89] a synthesis method for application specific address generators. Their method is dedicated for applications, where the sequence of storage and retrieval for particular blocks of data is strongly patterned, i.e. data is arranged in one of the following ways:

In the presented approach the necessary address patterns are generated either directly from a dedicated counter, or via circuit transformations (bit shuffling and combinatorial logic operations) applied to a counter output. See Generic address generation architectures. shows a generic model for this scheme, in which a counter is used to provide a consecutive address sequence, which is modified as necessary by an offset, δ, and transform, T.

The offset is additive and simply accounts for an arbitrary delay in commencing the sequence in of read operations after the commencement of write operations. This offset function may be avoided if the read sequence does not overlap the write sequence, in which case the counter may simply be reset to commence the read access; or the offset can be achieved by using a second counter started, or reset, to ensure read sequence synchronization. This arrangement is shown in See Generic address generation architectures. and can be beneficial for high speed applications.

The class of addressed problems can be characterized by partial regularity in the retrieval sequence. In particular this approach exploits redundancies that are present whenever patterns occur whose length is some power of two. Details of the synthesis method and an example can be found in See [GDF89].

The Address Generator of the Map-oriented Machine 2

The Map-oriented Machine 2 (MoM-2) has been developed form the MoM-1 (See Memory Access Processor (MAP) of the Structured Memory Access (SMA) Machine.). The innovation in comparison with the MoM-1 is the 2-level address generation. On the lower level a window based access to the memory is performed. The window location is determined on the upper level, where the first time the slider method See [HHW90] has been used for address generation. This address generation method makes the MoM-2 address generator very flexible.

The complete MoM-2 architecture See [Hir91] will be introduced in See The Map-oriented Machine 2. The MoM-2 integrates a data sequencer for address generation to access a 2-dimensionally organized data memory. A detailed explanation of the data sequencer of the MoM-2 is given in See [Sch90]. The data sequencer of the MoM-2 (See The data sequencer of the Map-oriented Machine 2.) consists of a Task Manger, a JumpGenerator, and a Single Step Control Unit (SSCU), which operate in a pipelined fashion. The data sequencer is configured by parameters and needs no instructions to generate data accesses. Therefore also no instruction memory and no memory cycles to fetch instructions are necessary. The Task Manager holds all parameters necessary for address generation and controls the JumpGenerator and the SSCU by changing their configuration.

A basic MoM-2 computation step includes the following actions:

The JumpGenerator of the MoM-2

The data sequencer of the MoM-2 is based on a single JumpGenerator (see See The JumpGenerator of the Map-oriented Machine 2.) which computes the base location of the cache frame in the data memory. The x- and y-address is generated generically on the basis of a few parameters, which are explained in detail in See [Hir91]. Therefore it utilizes two identical 1-dimensional address generators which are synchronized by a special control logic.

Each 1-dimensional address generator holds three identical steppers (see See Map-oriented Machine 2 Stepper Architecture.). Each stepper is programed by an initial value which is modified by positive or negative number in each step. The Stepper is running in a loop and generates addresses until a maximum value is reached. While the Base- and Limit-Stepper operate on parameters given by the Task Manger, the Address Stepper obtains its initial and maximum values by Base- and Limit-Stepper.

The JumpGenerator may generate basic 2-dimensional access pattern without interaction of the next higher control instance. It has no fixed step-width and also the loop-boundaries can be set at random. For complex access pattern the control of the Task Manager is required. For instance an access pattern like the JPEG zig-zag scan requires frequent exchange of the parameters.

To describe the access pattern generation of the JumpGenerator in See [HHW90] the slider model has been proposed. This model visualizes the generic address generation process based on a small parameter set. The slider model will be explained in detail in See Address Generation with the Slider Method.

The Single Step Control Unit of the MoM-2

The Single Step Control Unit (SSCU) controls the updates of a scan cache of four by four data words on the basis of the base address provided by the JumpGenerator. The size of the scan cache may be reduced by skipping away complete rows or columns.

The updates of the Scan Cache between two steps may affect the complete cache or may be optimized with a shift register structure. The optimization takes place if subsequent positions differ only by one horizontal or vertical direction. In that case only the new row or column is updated, while those overlapping with previous scan cache positions are shifted to their proper location. Only at the beginning and the end of loops in the JumpGenerator, a full cache update is performed. By this method many memory accesses can be saved.

Internally the SSCU utilized two 2-bit counters to generate the memory addresses. These counters are controlled by a FSM based on the programming from the Task Sequencer. See The Single Step Control Unit of the Map-oriented Machine 2 for scan window generation. shows the basic architecture of the SSCU.

The Video Signal Processor

In See [KOD91] an application specific address generation unit (AGU) for video signal processing is described. The AGU has been designed to generate image memory addresses in a digital signal processor (DSP). In this thesis this DSP is referred to as video signal processor (VSP). This approach is in contrast to conventional DSP architectures, which are mostly optimized to perform a large number of high-speed multiply-accumulate operations.

The AGU implements some ideas from the MoM-2 address generator. It implements a 2-level address generation with window based memory access. Compared to the MoM-2 it is less flexible, because the high level address generation does not implement the complete slider method. The AGU is based on an application domain specific address generation model derived from slider method.

The Overall Architecture of the DSP

See The video signal processor (VSP) as a basis for the utilization of the address generation unit (AGU). shows the overall architecture of the DSP which includes the AGU proposed in See [KOD91]. An important architectural feature is the employment of independent AGUs and execution unit (EU).

The three AGUs are implemented to calculate the address for external image memory. The number of AGUs, three, is obtained by analyzing how many image memories are used in video signal processing. The constitution of the AGUs independent of the EU is effective, because the EU may concentrate on calculations for actual data, and therefore, the AGUs and the EU can operate in parallel.

Addressing Modes of the AGU

Seventeen addressing modes are prepared in the AGU (See VSP address generation unit (AGU) addressing modes. and See VSP address generation unit (AGU) sub modes of neighborhood search. show the addressing modes). They were defined by analyzing image processing algorithms. The relationship between the analyzed typical image processing algorithms aid their associated addressing modes. Seven initial parameters, e.g., the start address, the end address, the increment value, and the aspect ratio of the image plane, have to be specified before execution.

The 2-dimensional raster scan mode (See VSP address generation unit (AGU) addressing modes.(1)) is similar to tile TV raster scan method, and it is used in most image processing algorithms. In this 2-dimensional raster scan mode, image data are accessed from the left edge to the right edge along the scanning line. When scanning reaches the right edge, it returns back to the left edge of the line just below the previous one and begins to scan again.

The block scan mode (See VSP address generation unit (AGU) addressing modes.(2)) is suited for spatial filtering. In the block scan mode, the local area kernel itself slides like in a 2-dimensional raster scan while the image data are scanned within the local kernel.

The neighborhood search mode (See VSP address generation unit (AGU) addressing modes.(3)) is used in algorithms, like labeling, region segmentation, and border following. In the neighborhood search mode, the AGU successively generates the neighbor addresses of an address which is generated by other methods such as the indirect access mode. Eight variations to generate these neighbor addresses are prepared in the neighborhood search mode.

See VSP address generation unit (AGU) sub modes of neighborhood search. shows the eight variations of the neighborhood search mode. In See VSP address generation unit (AGU) sub modes of neighborhood search., the number in each grid means the accessing order. For example, in the NAC8 mode, the eight neighboring points are accessed according to the accessing order. This accessing method was employed by considering the adjacent definition in image processing. Furthermore, it is effective to use from NRS8U to NRS4L modes combined with the 2-dimensional raster scan mode in some particular image processing applications.

The 2-dimensional indirect access mode (See VSP address generation unit (AGU) addressing modes.(4)) is used when the AGU outputs an irregular address which is calculated by using the arithmetic unit in the EU. In this mode, the AGU adds the offset values to the results of the arithmetic units, and generates the address of the external image memory.

The 1-dimensional raster scan mode and the 1-dimensional indirect access mode (See VSP address generation unit (AGU) addressing modes.(5),(6)) are 1-dimensional versions of the 2-dimensional raster scan mode and the 2-dimensional indirect access mode, respectively.

In addition to these modes, the FFT8 mode and the affine transformation mode are prepared as special modes. In the FFT mode, not only the addresses of a 1-dimensional FFT but also those of a 2-dimensional 1FFT can be generated.

The Architecture of the AGU

See The VSP address generation unit (AGU) architecture. shows the constitution of the AGU. The AGU consists of 32 bit counters, 32 bit adders, barrel shifters, a bit reverse circuit and a neighborhood access decoder. It is structured as a three stage pipeline. Barrel shifters are used for changing the aspect ratio of a 2-dimensional image plane. This barrel shifter can perform both left and right 24 bit shifts, and therefore the aspect ratio of the image plane can be changed from a 1 bit x 16M bit narrow figure to a 16M bit x 1bit wide figure. A bit reverse circuit which can adjust the bit reverse width is used in the FFT addressing mode. The neighborhood access decoder generates the relative distance between a central point and neighborhood points. The neighborhood addresses are generated by adding the distance to the address of the central point which is assigned from the internal data bus according to the appointed neighborhood access mode described in See VSP address generation unit (AGU) sub modes of neighborhood search..

The selectors in See The VSP address generation unit (AGU) architecture. are controlled according to the appointed addressing mode. The maximum value, minimum value, or increment value, of the counter is set by initial parameters.

The Address Generator of the Map-oriented Machine 3

The MoM-3 is the third machine in the series of map-oriented machines. It was developed further from the MoM-2. It operates on 2-dimensional memory and is based on generic principles. The address generation method of the MoM-3 data sequencer is the same as for the MoM-2. Therefore also the range of supported address pattern is the same. But the MoM-3 data sequencer architecture has been extended to support multiple (up to 7) access pattern at the same time. This has been a great drawback of the MoM-1 and MoM-2, because they are limited to applications operating only on one data set.

See The Data Sequencer of the Map-oriented Machine 3. shows the data sequencer of the MoM-3. All components of the MoM-3 including the data sequencer are communicating via a single bus called MoMbus which is derived from the VMEbus See [Pet89]. The data sequencer of the MoM-3 consists of an Instruction Sequencer (IS, See [Zip94]) and up to 7 Generic Address Generators (GAG, See [Web93] See [Sch95]). Since there is only one bus for data and address communication, the GAGs are operating according to the Round Robin method. That way the GAGs synchronize address generation. As a result of this method the data sequencer has to be set-up for an application by connecting or disconnecting GAGs physically, because each connected GAG has to provide an address in each loop.

The GAG is a single device, which computes a sequence of addresses from a set of parameters. The parameters are programed by the Instruction Sequencer. To support efficient access to structured data, a GAG operates in a two-stage pipeline. The first stage computes handle positions for a so-called scan window, which represents a neighbourhood of data elements around the handle position. Therefore this stage is called Handle Position Generator (HPG). A handle position consists of two 16-bit values for the two dimensions of the data memory. The sequence of handle positions describes how the corresponding scan window moves across the data memory (See Block diagram and operation illustration of the MoM-3 generic address generator.). Such a sequence of handle positions is called scan pattern.

The second pipeline stage computes a sequence of offsets to the handle positions, to obtain the effective memory addresses for the computations. Therefore this stage is called Memory Address Generator (MAG). The offsets may be programmed to appear in arbitrary order, because the memory addresses are computed from simple RISC9-like instructions (See Block diagram and operation illustration of the MoM-3 generic address generator.).

The Handle Position Generator

The Handle Position Generator (HPG) has been derived from the MoM-2 JumpGenerator with some improvements but it is still based on the same address generation method. The programming model of the memory is a two-dimensional map. Therefore the HPG consists of two identical parts, one for each dimension (See Handle position generator of the MoM-3 generic address generator.). The so-called 1-D Address Generators operate in parallel and synchronize through a trigger logic, which preserves the symmetry of the design. The programmer decides, which (if any) 1-D Address Generator operates as master, triggering the other 1-D Address Generator to perform a step. The trigger logic routes the trigger signals according to the programmer's specification. Furthermore, it evaluates all conditions, which require a knowledge of the state of the complete system (e.g. when a scan pattern terminates).

The one-dimensional positions produced by each of the 1-D Address Generators are checked against segment limits, before they are presented to the Memory Address Generator as valid handle positions. This serves as a kind of memory protection scheme, which is especially useful, if the handle positions are computed under control of the data manipulating devices, dependent on the data processing results. The segment check units make sure, that all handle positions of a GAG remain within a programmer-defined orthogonal bounding box in the two-dimensional memory map. By providing the maximum and minimum offsets that occur in the Memory Address Generator program, the segment checks can even evaluate whether the bounding-box would be left during the generation of memory addresses and invalidate such a handle position before it is passed on in the pipeline. The actual address computations are done by the 1-D Address Generators.

The Memory Address Generator

In contrast to the HPG the Memory Address Generator (MAG) is a great improvement in flexibility of the MoM-3 compared to the MoM-2. The MAG resembles a kind of RISC processor with a special purpose instruction set and on-chip instruction memory. The only task that has to be performed by the MAG, is to provide an arbitrary sequence of offsets to be added to the handle position to perform memory accesses in a local neighbourhood around the scan window handle position. The most efficient way to support arbitrary offset sequences, is to make the offsets directly programmable in a memory that is scanned linearly. At every new handle position, the memory is scanned again from the beginning. But since the length of the offset sequence varies from application to application, an escape mechanism has to be programmed into the offset memory, to signal the end of an offset sequence. Interpreting the offsets as read or write instructions and the escape code as branch to the beginning, the MAG can be programmed with a small and simple instruction set. Additional instruction codes have to be introduced to support pipelining in the data processing devices with the requirement of conditionally executed read or write operations at the beginning and at the end of loops, to fill the pipeline, and to flush results remaining in the pipeline.

The range of offsets may be -32 to +31 in both dimensions of the data memory. The memory to store the instructions (address instruction RAM, AIR) has 256 entries, so that a maximum of 253 references to the data memory can be made from a single handle position. At least three instructions are overhead, which have to be inserted to accept a new handle position from the Handle Position Generator, to signal the start of computations to the data processing devices, and to jump to the beginning at the end of an offset sequence.

After the instructions have been fetched and decoded, the offsets have to be added to the current handle position. A number of steps are performed with each of the resulting addresses, to transform them into physical memory addresses. These tasks are completely hardwired, because they have to be applied to each address. The first is an address modification to allow for cyclic addressing. For each dimension, a CycleMask word is used to keep selected bits of the address from changing, and a CyclePattern word provides default values for the masked bits. If the CycleMask is partitioned into a leading block of 16-k zeroes and a trailing block of k ones, for example, the resulting addresses automatically wrap around at the 2 k boundary, because the values of the higher (16 - k) bits are masked. Input to the following step are still two 16-bit address words, one for each dimension. To be able to access a conventional linear memory, the address parts have to be combined to a linear memory address. The address parts of the two dimensions may be combined to a real linear memory address in four ways: one row of two-dimensional memory (x dimension) may consume an address space of 10, 12, 14, or 16 bits. This allows to adjust the "size" of the data memory to the size of the processed data, to reduce wastage of address space. After the concatenation of the two address parts to a linear address, a 32-bit base address is added, to obtain the effective memory address. The base address typically is the starting address of the data array referenced by this GAG. Finally, the bus interface handles the protocol for the actual data transfers between the data memory and the data manipulating devices, using the effective memory address to access the data memory. A block diagram (See Memory address generator of the MoM-3 generic address generator.) illustrates these tasks, as they are performed in the Memory Address Generator. The AIRport register has been introduced for two reasons. First, the instructions in the Address Instruction RAM (AIR) are only 16 bits wide, so that the AIRport register serves as an interface register, to transfer two instructions with one configuration data transfer. Second, to the processor which downloads the parameters, the Address Instruction RAM is hidden behind the AIRport register and consumes only a single address. The programming model corresponds to a shift register, allowing to write a complete instruction stream to the same AIRport address during configuration. This can be done efficiently using block transfers. This concept improves the scaleability of the GAG, because different sizes of the Address Instruction RAM do not require a different layout of configuration register addresses, but only a different length of the block transfers. This is true for the number of instructions that can be stored as well as for the format of the instructions (32-bit instructions would allow for a larger offset range, for example). The resulting changes to the download software can easily be parameterized, whereas the processor interface remains unchanged for all these variations of Memory Address Generators.

Comments on the MoM-3 Address Generator

The Map-oriented Machine 3 (MoM-3) address generator has been developed from the MoM-2 address generator (See The Address Generator of the Map-oriented Machine 2). As well as in the MoM-2 a 2-level address generation using the slider method See [HHW90] is performed. Additionally the MoM-3 utilizes multiple address generators operating in parallel. This accelerates address generation and enables multiple scan windows operating in parallel. Further the MoM-3 address generator supports a flexible low level sequencing, which is needed for the data scheduling performed by the DPSS See [Kre96]. The overall MoM-3 architecture See [Rei99] will be introduced in See The Map-oriented Machine 3. A detailed explanation of the data sequencer of the MoM-3 is given in See [Web93] or See [HR95]. The research presented in this thesis was started with the completion of the MoM-3 implementations. Therefore the MoM-3 data sequencer can be seen as the predecessor of this work. For this reason in this subsection some detailed critique on the MoM-3 data sequencing is given.

The address generation method is optimized to generate data accesses to 2-dimensional data sets. While the address generation hardware is optimized to generate such access sequences, it doesn't exploit 2-dimensional memory organization for access optimizations like e.g. interleaved memory. The only performed memory access optimization is the cache like exploitation of scan window overlappings.

The main address generation is performed by the GAGs. Because the GAGs can't change their parameter set during runtime, they can only perform two nested loops per dimension. To perform more complex scan pattern (e.g. the zig-zag enumeration step of the JPEG image compression method), the GAGs must be stopped and reprogrammed by the software based Instruction Sequencer. Some complex address calculations are completely performed by the Instruction Sequencer. This inflexibility of the GAG causes also, that regular scan windows can't be read generically by the GAG, but must also be read by the MAG. This causes a high configuration overhead for such applications.

After address generation the 2-dimensional address is mapped to the 1-dimensional address space of the linear physical memory. Instead of using a flexible mapping scheme like e.g. the VSP in See The Video Signal Processor, a multiplexer selects between four fixed mappings (see Memory Address Generator at See The Memory Address Generator). While the MoM-3 implementation of this 4 to 1 multiplexer needs more hardware as an optimized barrel shifter, this mapping scheme wastes also a lot of data memory for many applications.

The bus oriented implementation of the MoM-3 causes a serious bus bottleneck. All addresses and data must be transferred via the single MoMbus and all data is stored in the same memory (see See The Map-oriented Machine 3). Since the generic address generation method is quite fast, only one GAG is really working at a time and the others are waiting with a bus request.

While parallel GAGs could logically work concurrently, this coarse grain parallelism is also often impossible because of the data dependencies of the application. Usually the data needed at a time is determined by a single scan window and not by multiple scan windows. As a result most applications don't exploit the possible parallelism of multiple scan windows.

In fact the MoM-3 data sequencer has a big hardware overhead, since only one GAG is working at a time. This overhead could be alleviated for some applications by having multiple datapaths for address and data as briefly considered in See [HKR95a]. In some cases parallelism could be exploited.

In See [Rei99] a GAG architecture, with exchangeable programming, has been briefly considered for the MoM-3. This architecture would avoid the hardware overhead of the implemented MoM-, by programming a single GAG with the parameters for several video scans. During execution the suggested GAG would switch between the configurations.

The Texas Instruments TMS320C54x DSP

In this section a commercial digital signal processor (DSP) is exemplary introduced and its data address generation method explained. While it performs software based sequencing, it provides a certain hardware support to avoid too much address computation overhead.

The Texas Instruments TMS320C54x devices are fixed-point DSPs in the TMS320 family See [TI97]. The '54x meets the specific needs of real-time embedded applications, such as telecommunications. It combines an advanced Harvard architecture (with one program memory bus, three data memory buses, and four address buses), a CPU with application specific hardware logic, on-chip memory, on-chip peripherals, and a highly specialized instruction set.

TMS320C54x Key Features

The CPU (see See The Texas Instruments TMS320C54x DSP architectural overview.) is an advanced multibus architecture with one program bus, three data buses, and four address buses. It includes a 40-bit arithmetic logic unit (ALU), including a 40-bit barrel shifter and two independent 40-bit accumulators. Additionally a 17-bit x 17-bit parallel multiplier is coupled to a 40-bit dedicated adder to perform a non pipelined single-cycle multiply/accumulate (MAC) operation. Furthermore the CPU includes a compare, select, store unit (CSSU) to perform add/compare selection operations as needed in the Viterbi operator. An exponent encoder is integrated to compute the exponent of a 40-bit accumulator value in a single cycle.

To perform data sequencing two dedicated address generators are available, including eight auxiliary registers and two auxiliary register arithmetic units.

Separate program and data spaces allow simultaneous access to program instructions and data, providing a high degree of parallelism. For example, three reads and one write can be performed in a single cycle. Instructions with parallel store and application-specific instructions fully utilize this architecture. In addition, data can be transferred between data and program spaces. Such parallelism supports a powerful set of arithmetic, logic, and bit-manipulation operations that can all be performed in a single machine cycle. Also, control mechanisms to manage interrupts, repeated operations, and function calling are included.

Memory Organization

The TMS320C54x has 192K words x 16-bit addressable memory space (64K-words program, 64K-words data, and 64K-words I/O). Further it integrates several types of on-chip memory:

  • 2K, 20K or 32K-words program ROM,
  • 0, 8K or 16K-words program/data ROM,
  • 5K, 6K, 8K, or 10K-words dual-access RAM (DARAM), and
  • 0 or 24K-words single-access RAM (SARAM).

The internal RAMs can be configured as data memory or as program/data memory. The DARAM can be accessed twice per machine cycle, the CPU can read and write to a single block in the same cycle.

Data Addressing

The TMS320C54x has an independent data-address generation logic (DAGEN, see also See The Texas Instruments TMS320C54x DSP architectural overview. and See [TI97]). The CPU offers seven basic data addressing modes. During the execution of instructions using direct, indirect, or memory-mapped register addressing, the DAGEN computes the addresses of data-memory operands. The basic addressing modes are:

  • Immediate addressing uses the instruction to encode a fixed value.
  • Absolute addressing uses the instruction to encode a fixed address.
  • Accumulator addressing uses the accumulator A to access a location in program memory as data.
  • Direct addressing uses seven bits of the instruction to encode the lower seven bits of an address. The seven bits are used with the data page pointer (DP) or the stack pointer (SP) to determine the actual memory address.
  • Indirect addressing uses the auxiliary registers to access memory.
  • Memory-mapped register addressing uses the memory-mapped registers without modifying either the current DP value or the current SP value.
  • Stack addressing manages adding and removing items from the system stack.

In the following sections the three addressing modes performed by the DAGEN are explained in detail.

The Direct Data Addressing

In direct addressing mode (see See TMS320C54x direct addressing block diagram.), the instruction contains the lower seven bits of the data-memory address (dma). The 7-bit dma is an address offset that is combined with a base address, with the data-page pointer (DP), or with the stack pointer (SP) to form a 16-bit data-memory address. Using this form of addressing, any of 128 locations can be accessed in random order without changing the DP or the SP.

Either DP or SP can be combined with the dma offset to generate the actual address. The compiler mode bit (CPL), located in status register ST1, selects which method is used to generate the address:

  • When CPL = 0, the dma field is concatenated with the 9-bit DP field to form the 16-bit data-memory address.
  • When CPL = 1, the dma field is added (positive offset) to SP to form the 16-bit data-memory address.
The Indirect Data Addressing

In indirect addressing, any location in the 64K-word data space can be accessed via a 16-bit address contained in an auxiliary register. The TMS320C54x has eight 16-bit auxiliary registers (AR0-AR7). Indirect addressing is used mainly when there is a need to step through sequential locations in memory in fixed-size steps.

When memory is addressed with indirect addressing, the auxiliary register and the address can be optionally modified by a decrement, an increment, an off-set, or an index. Special modes offer circular and bit-reversed addressing. A circular buffer size register (BK) is used with circular addressing. The AR0 register is used for indexed and bit-reversed addressing modes in addition to being used to point to memory as the other auxiliary registers do.

Indirect addressing is flexible enough not only to read or write a single 16-bit data operand from memory with one instruction, but also to access two data-memory locations with one instruction. Accesses of two data-memory locations include reads of two independent memory locations, reads and writes of two consecutive memory locations, and a read of one memory location combined with a write to a memory location.

Two auxiliary register arithmetic units (ARAU0 and ARAU1) operate on the contents of the auxiliary register. The ARAUs perform unsigned, 16-bit auxiliary register arithmetic operations. The auxiliary registers can be:

  • Loaded via an immediate value.
  • Loaded via the data bus by writing to the memory-mapped auxiliary registers.
  • Modified by the indirect addressing field of any instruction that supports indirect addressing.
  • Modified by the modify auxiliary register instruction.
  • Used as loop counters.

See TMS320C54x indirect addressing block diagram for dual data-memory operands. shows the ARAUs used to generate an address in the indirect addressing mode using a single data-memory operand. As the figure shows, the main components used for address generation in indirect addressing are the auxiliary register arithmetic units (ARAU0 and ARAU1) and the auxiliary registers (AR0-AR7).

Dual data-memory operands addressing is used for instructions that perform two reads or a single read and parallel store at the same time. These instructions operate all in indirect addressing mode only. If the source operand and the destination operand point to the same location, in instructions with a parallel store, the source is read before writing to the destination.

The Memory-Mapped Register Data Addressing

Memory-mapped register addressing is used to modify the memory-mapped registers without affecting either the current data-page pointer (DP) value or the current stack-pointer (SP) value. Because DP and SP do not need to be modified in this mode, the overhead for writing to a register is minimal. Memory-mapped register addressing works for both direct and indirect addressing. Addresses are generated by:

  • Forcing the nine most significant bits (MSBs) of data-memory address to 0, regardless of the current value of DP or SP when direct addressing is used.
  • Using the seven LSBs of the current auxiliary register value when indirect addressing is used.

The Adopt Project

Adopt is a high-level address optimization and hardware realization environment See [MCJ97]. Adopt aims to minimize the cycle, area and power overhead typically present for address generation in Data-Transfer Intensive (DTI) applications. These applications often make use of embedded distributed memories to cope with the increasing bandwidth requirements. Different system-level optimizing alternatives suitable for custom processors are explored to reduce the cost overhead typically present in partitioned architectures. They include target architecture selection, address expression splitting/clustering and global algebraic optimizations, amongst others.

The Target Architecture Styles

An alternative to a fully programmable memory management unit10 (MMU) architecture is a dedicated architecture that benefits from customized implementation possibilities. To alleviate the interconnection overhead in terms of delay, power and area associated with the routing between the underlying (hierarchical) memory architecture and the MMU, a distributed address unit architecture is then selected.

The customized MMU architecture (cMMU) is based on many small application-specific Address Calculation Units (ACUs) which are controlled by a hierarchical (master/slave, global/local) hardwired controller. In the most straightforward and usually inefficient case, this leads to a direct mapping of each individual array index reference onto dedicated hardware, e.g See [LMW91]. This strategy can also lead to a large cost overhead if not executed well, although in this case the overhead only depends on the overall contribution of the local cost in the address generators and not on the program ROM and associated decoders.

For some applications it is still desirable to provide some (re)programming capabilities. In this case, a combination of programmable and customized ACUs is required for flexibility.

For indexed addressing, a fixed storage ordering is selected when multidimensional signals are presented. Typically this order is first optimized for size reduction by so called in-place mapping See [GCM97]. Then, an address expression (AE) for every manifest index expression of a signal located in a linearly addressed RAM is generated.

For custom processors, two clearly differing target architecture styles can be identified: incremental address generation unit (iAGU) and custom address calculation unit (cACU) (see See Adopt target architecture styles for custom processors: (a) example application with index function, (b) custom address calculation unit (cACU), and (c) incremental address generation unit (iAGU).).

In the iAGU case, Address Sequences (ASs) are generated and realized as a counter modified by a two- or multi-level logic filter. This style requires an explicit expansion of the AEs, so it can be applied only to manifest (compile-time known) accesses. Counter based architectures require two control lines to obtain the next memory address, and to set the addressing unit to an initial state. These lines are generated from the global/master cMMU controller.

In the cACU case, the AEs are realized as Application-Specific Units (ASUs) with custom arithmetic building blocks selected from a library. Use is made of a subset of the Cathedral-3 methodology for lowly multiplexed custom data-path synthesis See [NGC91]. This architecture style has been first proposed in an environment as an alternative to the iAGU style for the synthesis of address generators See [MCM94] See [MCJ96]. For cACUs, the complete set of iterator states must be supplied by a local controller. Also data-dependent values can be fed into the inputs of the ASU thus both run-time and manifest data dependencies can be supported.

Selection of Incremental Address Generation Units

For custom targets, the least expensive solution in area and power can be to map the AS onto an iAGU rather than a cAGU. This decision depends heavily on regularity properties of the target AS See [MKC97], and these properties can very accurately predict the need for extra logic between the counter and the address port bits.

Selection of Custom Address Calculation Units

Many other ASs with irregular access patterns are much more area hungry, depending heavily on the sequence length. In this cases, the same AE can be efficiently implemented using a cACU, thus having up to one order of magnitude in combinatorial area.

This trade-off is for power and area since a regular access pattern implemented using a counter has potentially much less toggle activity than if implemented using a cACU. The opposite is true for an irregular pattern.

Mapping of Custom Address Calculation Units

The methodology used for the mapping is not based on performing scheduling first, followed by the traditional functional-unit, mux and register allocation See [FPC90]. In See [MCJ96] has been shown that it is better to first identify similar clusters of expressions (arithmetic of addresses and conditions) in the algorithm. The goal of this phase is to obtain an optimal partitioning of the cACU in ASUs. After cACU partitioning, further regularity improvement between the shared clusters See [JCM96] aiming to decrease the multiplexing overhead is possible by applying local scope algebraic transformations. After that it is mapped by a commercial behavioral synthesis tool.

Synthesis of Incremental Address Generation Units

For iAGUs, also AE multiplexing possibilities are provided but at the sequence level by expanding the AE at compile time. In this way an interleaved (time-shared) AS version of the multiplexed AEs can be obtained.

Again, system level similarities (this time over the ASs) are needed to select the best sharing possibilities. Heuristic measures based on first and second order differences between pairs of AS values are sufficient See [MCM94]. The result is a partitioned iAGU architecture in terms of counters and look-up tables.

After counter sequence sharing decisions, the characteristics of the counter (i.e., counter modulo value, reset and increment state values, etc.) must be defined still for an optimal synthesis of the iAGU unit. Analysis of the (pseudo) periodicity properties of the target AS See [MKC97] See [MCJ97] is exploited. Also, the optimal assignment of counter to address port-bit, detected during the architecture exploration phase is finally incorporated See [GD91].

Once the exact definition of the different iAGUs has been obtained, RT-level merging possibilities amongst the resulting iAGUs can still be explored. Again, similarities across shared ASs are used to decide on the best candidates for the merging of counters and/or look-up tables See [MKC97]. However, as opposed to the sharing step, the search space is now restricted to ASs with common state transitions (same loop scope) to reuse the counters. Performing the final bit-level optimization of the merged look-up tables is left to subsequent logic synthesis.

The Intersil HSP45240 Address Sequencer

The Intersil HSP45240 address sequencer See [Int97] See [Int98a] is a commercial stand-alone address generator device, which provides specialized addressing for functions like FFTs, 1-D and 2-D filtering, matrix operations, and image manipulation. The sequencer supports block oriented addressing of large data sets up to 24-bits at clock speeds up to 50 MHz.

Functional Description

The Address Sequencer is a 24-bit programmable address generator. As shown in See Intersil HSP45240 address sequencer block diagram., the sequencer consists of 4 functional blocks: the start circuitry, the sequence generator, the crosspoint switch, and the processor interface. The addresses produced by the sequence generator are input into the crosspoint switch. The crosspoint switch maps 24 bits of address input to a 24-bit output. This allows for addressing schemes like "bit-reverse" addressing for FFT's11. A programmable delay block is provided to allow the MSW (most significant word) of the output to be skewed from the LSW (least significant word). This feature may be used to compensate for processor pipeline delay when the sequence generator is configured as two independent 12-bit sequencers. Address Sequencer operation is controlled by values loaded into configuration registers associated with the sequence generator, crosspoint switch, and start circuitry. The configuration registers are loaded through the processor interface.

The Start Circuitry

The Start Circuitry generates the internal START signal which causes the Sequence Generator to initiate an addressing sequence. The START signal is produced by writing the Processor Interface's "Sequencer Start" address (see See [Int97] See [Int98a]), by asserting the STARTlN input, or by the terminal address of a sequence generated under "One-Shot Mode with Restart" (see sequence generator at See The Sequence Generator). A programmable delay from 1 to 31 clocks is provided to delay the initiation of an addressing sequence by delaying the internal START signal.

The Start Circuitry generates the output signal ADDVAL which is asserted when the first valid output address is at the pads. In addition, the Start Circuitry generates the "STARTOUT" signal for multichip synchronization. STARTOUT is only generated when an addressing sequence is started by writing the "Sequencer Start" address of the Processor Interface, or an internal START is generated by reaching the end of an addressing sequence produced by "One-Shot Mode with Restart".

The Sequence Generator

The Sequence Generator is a block oriented address generator. This means that the desired address sequence is subdivided into one or more address blocks, each containing an user defined number of addresses. User supplied configuration data determines the number of address blocks and the characteristics of the address sequence to be generated.

As shown in See Intersil HSP45240 sequence generator block., the Sequence Generator is subdivided into the address generation and control sections. The address generation section performs an accumulation based on the output of MUX1 and MUX2. The control section governs the operation of the multiplexers, enables loading of the Block Start Address register, and signals completion of an address sequence.

An address sequence is started when the control section of the Sequence Generator receives the internal START signal from the Start Circuitry. When the START signal is received, the control section multiplexes the contents of the Start Address Register and a "0" to the adder. The result of this summation is the first address in the first block of the address sequence. This value is stored in the Block Start Address register by an enable generated from the control section, and the multiplexers are switched to feed the output of the Holding and Address Increment registers to the adder. Address generation will continue with the Address Increment added to the contents of the Holding Register until the first address block has been completed.

An address block is completed when the number of addresses generated since the beginning of the address block equals the value stored in the Block Size register. When the last address of the block is generated, BLOCKDONE is asserted to signal the end of the address block (see See [Int98b]). On the following CLK, the multiplexers are configured to pass the contents of the Block Start Address and Block Increment registers to the adder which generates the first address of the next address block. An enable from the control section allows this value to update the Block Start Address register, and the multiplexers are switched to feed the Holding and Address Increment registers to the adder for generation of the remaining addresses in the block.

The address sequence is completed when the number of address blocks generated equals the value loaded into the Number of Blocks register. When the final address in the last address block has been generated, DONE and BLOCKDONE are asserted to signal the completion of the address sequence.

The parameters governing address generation are loaded into five 24-bit configuration registers via the Processor Interface. These parameters include the Start Address, the beginning address of the sequence; the Block Size, the number of addresses in the address block; the Address Increment, the increment between addresses in a block; the Number of Blocks, the number of address blocks in a sequence (minimum 1); the Block Increment, the increment between starting addresses of each block. The loading and structure of these registers is detailed in See [Int97] or See [Int98a].

Three modes of operation may be selected by loading a 6-bit Mode Control register. The three modes of operation are:

  • One-Shot Mode without Restart Address generation halts after completion of the user specified address sequence. Address generation will not resume until the internal START signal is generated by the Start Circuitry. When the final address in the final block of the address sequence is generated, both DONE and BLOCKDONE are asserted and the last address is held on OUT0-23 (see See [Int98b]).
  • One-Shot Mode with Restart: This mode is identical to One-Shot Mode without Restart with the exception that the Start Circuitry automatically generates an internal START at the end of the user specified sequence to restart address generation. The end of the address sequence is signaled by the assertion of DONE, BLOCKDONE, and STARTOUT as shown in See [Int98b]. In this mode, the first address of the next sequence immediately follows the last address of the current sequence if start delay is disabled.
  • Continuous Mode: Address generation never terminates. Address generation proceeds based on the Start Address, Address Increment, Block Size, and Block Increment Parameters. The Number of Blocks parameter is ignored, and the DONE signal is never asserted.

The Mode Control register is also used to configure the Sequence Generator for operation as two independent 12-bit address sequencers. In dual sequencer mode, the adder in the sequence generator suppresses the carry from the 12 LSBs (least significant bits) to the 12 MSBs (most significant bits). With the carry suppressed, two independent sequences may be produced. These 12-bit address sequences may be delayed relative to each other by programming the Mode Control register for a delay up to 7 clocks. This feature is useful to compensate for pipeline delay when using dual sequencer mode to generate read/write addressing. The DLYBLK input can be used to halt address generation at the end of any address block within a sequence. In addition, DLYBLK can be used to delay an address sequence from restarting if asserted at the end of the final address block generated under "One-Shot Mode with Restart". See See [Int98b] for the timing relationship of DLYBLK to the end of the address block required to halt address sequencing.

The Crosspoint Switch

The crosspoint switch is responsible for reordering the address bits output by the sequence generator. The switch allows any of its 24 inputs to be independently connected to any of its 24 outputs. The crosspoint switch outputs can be driven by only one input, however, one input can drive any number of switch outputs. If none of the inputs are mapped to a particular output bit, that output will be "low".

The input to output map is configured through the processor interface. The I/O map is stored in a bank of 24 configuration registers. Each register corresponds to one output bit. The output bit is mapped to the input via a value, 0 to 23, stored in the register. After power-up, the user has the option of configuring the switch in 1:1 mode by using the reset input, "RST". In 1:1 mode the crosspoint switch outputs are in the same order as the input. More details on configuring the switch registers are contained in See [Int97].

Conclusions

In this chapter several address generators have been presented (see See The presented address generators at a glance.). Many commercial examples demonstrate, that address generation is an important topic not only in academic research. The development of the Map-oriented Machines 1,2 and 3 and their theoretical background is partially continued in this thesis. These machines can be seen as the preceding architectures of the Map-oriented Machine with Parallel Data Access presented below (see See The Map-oriented Machine with Parallel Data Access).

  1. The presented address generators at a glance.

 

Source of Information

Address Generator Type

Programmed by

Flexibility

Access Optimization

SMA machine

See [PD83]

software based sequencing, similar to a microprogrammed approach

instruction stream at run-time

flexible

-

MoM-1

See [Hir85]

application specific address generator

few parameters at load-time

limited repertory

-

Synthesis Method by Grant et al.

See [GDF89]

synthesis method for application specific address generators

not programmed, reconfigured at load-time

single access pattern

-

MoM-2

See [Hir91]

address generator

few parameters at load-time

flexible

Scan cache

VSP

See [KOD91]

DSP, software based sequencing with multiple address generators

instruction stream at run-time

flexible

concurrent memory access

MoM-3

See [Rei99]

address generator

few parameters at load-time

flexible

Scan cache

TMS320C54

See [TI97]

DSP, software based sequencing with hardware support

instruction stream at run-time

flexible

concurrent memory access

Adopt

See [MCJ97]

synthesis method for application specific address generators

not programmed, reconfigured at load-time

single access pattern

-

HSP45240

See [Int97]

address generator device

few parameters at load-time

limited repertory

-

Hardware support for address generation may be quite effective in featuring parallelism of sequencing operations, and / or reducing the number of memory cycles needed for address generation, and also in reduced configuration code size, and thus fast reconfiguration. In the following the memory bandwidth aspects of the presented address generators will be discussed.

The address generator synthesis methods presented in this chapter focus mainly on power and area optimization. In view to that, such methods are primarily used to reduce the design time of application specific data sequencers. Here the optimization of memory accesses in terms of memory bandwidth is second-rate.

Parallelism

All data sequencers presented are designed to operate more or less in parallel to the data manipulations to be performed by an ALU. This is a basic requirement to avoid delays in data manipulations caused by hold up data supply. As already mentioned in before, examples have been published See [PAC97], where a state of the art microprocessors stays idle for 75% of computation time because of outstanding memory requests. If the machine architecture supports concurrent data streams like the VSP (See The Video Signal Processor) and the TMS320C54x (See The Texas Instruments TMS320C54x DSP), the data throughput can be increased significantly. Also the address generator of the MoM-3 (See The Address Generator of the Map-oriented Machine 3) provides multiple generic address generators operating in parallel, but the single MoMbus does not support concurrent data accesses.

Reduction of the Number of Memory Cycles by Reducing Computational Overhead

The number of memory cycles may be reduced by a software to hardware migration of address computation. Software level address generation additionally burdens the memory interface with accesses for instructions, constants and variables to perform the address computation. A hardware level address generator would avoid such memory cycles completely. But since all presented address generators are optimized to reduce the latency of the address generation for a specific application domain, they are not general purpose. The MoM-1 (See The Address Generator of the Map-oriented Machine 1) and the VSP (See The Video Signal Processor) support only video scans. The SMA machine (See The Structured Memory Access Machine), the MoM-2 (See The Address Generator of the Map-oriented Machine 2), the MoM-3 (See The Address Generator of the Map-oriented Machine 3), and the HSP45240 (See The Intersil HSP45240 Address Sequencer) use a backup controller or microprocessor to perform address computations and/or to perform frequent reprogramming for complex address sequences. The TMS320C54x (See The Texas Instruments TMS320C54x DSP) DSP uses its ALU for complex address computations. But powerful address generators should avoid switching the address generation to the ALU even for complex address sequences. Therefore to set-up a high performance address generator, which is also flexible, a novel concept is needed.

The number of memory cycles may also be reduced by identifying and eliminating multiple accesses to the same data:

Avoiding Multiple Accesses to the Same Data

Specialized data sequencers have also to potential to accelerate memory accesses by access optimizations. The MoM-2 and MoM-3 data sequencers (See The Address Generator of the Map-oriented Machine 2 and See The Address Generator of the Map-oriented Machine 3) perform window based memory accesses, which can be seen as a cache like optimization. Data accesses are assigned to computation steps. Algorithms, which access the same data in successive computation steps, are optimized in a way that this data is only read once and stored in a smart interface. This optimization works only for local data.

Other address generators (MoM-1 and HSP45240) do not support any access optimization. The microprocessor examples (SMA machine, VSP, and TMS320C54x) may perform the typical access optimizations known from other microprocessors (e.g. hierarchic memory).

Reduced Configuration Code Size

The generic approach of most of the presented address generators reduces also the reconfiguration load. The presented hardware examples are configured by a few parameters and addresses are generated according to a generic model.

Reconfigurable Systems using Data Sequencing

This chapter introduces reconfigurable computing systems under consideration of the reconfigurable resource (See Coarse-grained Configurable Architectures), the memory technology (See Memory Technologies), and the sequencing mechanism (See Earlier Address Generators). With respect to See A Novel Data Sequencing Concept and See Data Sequencer use for Higher Memory Bandwidth it deals with memory architectures of reconfigurable systems and the role of the sequencer.

As mentioned in See Conclusions the use of reconfigurable systems reduces the number of memory cycles already by the characteristics of the computing principle. As illustrated in See The inherent advantage of reconfigurable systems in connection with the memory interface load: the time of instruction fetch (graphic taken from [Har00]). (see also See [Har00]) the time of instruction fetch moves for reconfigurable systems from the run-time to the load-time. Reconfigurable systems are programmed before the application is executed. During run-time there is no instruction fetch. While microprocessors handle branching by conditional jumps, reconfigurable systems use multiplexers for branching. Therefore the instruction fetch of reconfigurable systems does not burden the memory interface during run-time.

To have a look at reconfigurable systems, they are classified according to their data access mechanisms (see See Classification of configurable computers by data sequencing mechanism.). The two major classes are architectures operating on data streams and machines with random memory access. In stream oriented architectures data is received token by token from an external data source. The sequence of data is already fixed. Since in such architectures the data sequencing problem does not occur locally, they are not treated here. Some examples can be found in See [HHV97], See [HR98], and See [KD97].

In reconfigurable systems fetching data from RAM or register files either a classical sequencer or an address generator is needed:

Software Based Sequencer (SBS)

A SBS performs address calculations based on instructions to be fetched during run-time. It relies on an additional resource (e.g. the ALU of a microprocessor) for address computations.

Therefore the SBS is a processor programmed by software (or firmware) and fetches instructions during runtime. As already mentioned above, SBS has an inherent address computation overhead (e.g. see See Conclusions). In reconfigurable systems the SBS is usually a separate resource not mapped onto reconfigurable devices. Since reconfigurable systems using a SBS embody rather a traditional method to generate memory accesses, they do not alleviate the memory bandwidth problem. Nevertheless two approaches using a SBS will be presented in the following because of their memory architecture (PRISM-II, See The PRISM-II System, and Riley-2, See Riley-2).

Address Generator (AG)

The memory access sequence of an AG is fixed before runtime. An AG generates addresses by itself without acquiring other resources.

Therefore the AG includes a dedicated datapath for address generation. The access pattern of an AG may be

While fixed and configured AGs provide only a single address pattern or a limited repertory of addressing patterns, programmed AGs may also be quite flexible. Here the flexible AGs used in the MoM-2 (See The Map-oriented Machine 2) and MoM-3 (See The Map-oriented Machine 3) may be considered. Further both approaches achieve universality by relying on a microprocessor (SBS). But the term "universal" looks too much challenging in this context, since reconfigurable accelerators are commonly dedicated to a specific application domain.

The PRISM-II System

See The PRISM-II hardware platform. gives an overview on the PRISM-II system See [AWG94]. A detailed description of the hardware can be found in See [AW93]. PRISM is a reconfigurable computer architecture, which relies on a microprocessor for SBS. Here two parallel banks of DRAM are used to store data and instructions for the microprocessor. Further a burst-mode memory controller is used to perform the memory accesses based on the addresses generated by the microprocessor. While the most configurable computers utilize SRAM, here already DRAM is used. The benefit of DRAM is, that it is cheap and available in suitable size and capacity. Further burst mode memory accesses (see See Current Memory Devices Suitable to Form Large Data Memories) accelerate accesses to successive memory locations.

Riley-2

Riley-2 is also a reconfigurable system based on SBS. It has been developed as a dynamically reconfigurable platform for co-design research See [MCL97]. Riley-2 is a successor to an earlier system, known as Riley See [KMO97]. It is designed to be used in a PC with a PCI interface. See The Riley-2 architecture. shows a block diagram of Riley-2. The design has two main buses, the microprocessor's local bus and the reconfigurable resource bus.

The local bus connects the microprocessor, the shared memory, the reconfigurable resource interface, and the PCI interface. The Intel i960JF microprocessor See [Int99] is an integer RISC12 core used in many embedded system designs. The shared memory is implemented with a 16M byte Burst Enhanced Data Out (BEDO) DRAM See [Mic95a] See [Mic95b], allowing burst memory accesses (see See Current Memory Devices Suitable to Form Large Data Memories) to accelerate accesses to successive memory locations.

The reconfigurable resource bus connects four reconfigurable resource units, each containing a Xilinx 6216 FPGA See [Xil96] and a 512K byte fast local memory with a 32 bit data bus. During computations the local memory of the reconfigurable resource units is used to store intermediate results. Data sequencing is performed by the microprocessor, addressing the data in the shared memory.

An EPLD Based Transient Recorder for Video Signals

In See [Lar96] an EPLD based low cost transient recorder for video signals is presented. It is used to record and replay video signals. These signals are handled by a video processing unit, constructed in a VHDL environment (references on VHDL: See [Gho99], See [HJ96], See [LSU92]). See EPLD based transient recorder for video signals by L. Larson: (a) overview, and (b) controller EPLD. gives an overview on the overall system architecture. It is an example for a reconfigurable system with a fixed AG.

The transient recorder consists of an 8 bit Flash Analog Digital Converter (FADC), a Video Digital Analog Converter (VDAC), DRAM, and two EPLDs. The two EPLDs implement the whole digital processing of the transient recorder. One of these EPLDs (shifter) implements a 4 x 8 bit wide video shift register and an additional 4 byte data buffer, which is the datapath of the design. The other EPLD (controller) performs DRAM control, interfacing with the printer port and generation of the signals to control the data flow. See EPLD based transient recorder for video signals by L. Larson: (a) overview, and (b) controller EPLD. pictures the internal structure of the controller FPGA. Besides a finite state machine (cntlfsm) it integrates an application specific address counter (addcnt) and a DRAM control unit (dramcntl).

The RaPiD Data Sequencing Method

The RaPiD architecture has been described in See The RaPiD Architecture. The basic cell (see See Basic cell of RaPiD-1.) integrates also a memory block to save data needed over several cycles. The memory address is supplied either by a data bus or by a local address generator See [ECF96a]. See RaPiD local memory with address generator. shows the structure of the fixed AG, which is connected to the embedded memory cells. The AG supports only simple linear address sequences.

The PAR-1

The PAR-1 See [CMF95] is a FPGA-based coprocessor, which integrates a configurable AG. It is designed to be connected to a host computer. See The PAR-1: (a) machine overview, (b) sequencing and interface module (SIM), and (c) datapath module (DM) including local memory. shows the basic machine overview. The PAR-1 system consists of Sequencing and Interface Modules (SIM) and Datapath Modules (DM). A combination of one or several of both module types are connected to a host computer.

Each Datapath Module (DM) is composed by 4 FPGAs with 512k bytes of memory and a field programmable interconnect component (FPIC, See [OD95]). Each FPGA is connected independently to one memory bank of 128k bytes, that can be addressed by the SIM. See The PAR-1: (a) machine overview, (b) sequencing and interface module (SIM), and (c) datapath module (DM) including local memory. illustrates the DM with its components.

See The PAR-1: (a) machine overview, (b) sequencing and interface module (SIM), and (c) datapath module (DM) including local memory. pictures the Sequencing and Interface Module (SIM). The SIM is composed of two FPGAs, one devoted to data sequencing and the other for interface and main control purposes (e.g. host interface, load configuration, readback of data). The design of the address generator FPGA is application specific and has to be reconfigured for each application. No systematic address generator synthesis is performed. This results in unstructured implementations, which have to be placed and routed individually for each application.

The REACT System

REACT is a reconfigurable system with a configured AG. See The REACT hardware architecture. pictures the hardware architecture presented in See [BKS98]. It consists of two Xilinx XC6264 FPGAs See [Xil96] each with local memory. REACT is connected to a host computer via a PCI bus. Therefore a PCI interface is implemented with a Xilinx XC4020E See [Xil99]. If REACT is not under control of the host computer, the local memories are addressed by the XC6264 devices. For this an application specific memory controller and address generator is synthesized by the REACT software environment (see also See [BKS98]). Here reconfigurability is used as an enabler for resource optimization.

The CHAMP Architecture

CHAMP (Configurable Hardware Algorithm Mappable Preprocessor) is a reconfigurable system using a programmable AG. See The CHAMP timing and control unit. shows the timing and control unit of the CHAMP architecture presented in See [Box94]. The CHAMP technology has been developed by Wright Labs AAAT and Lockheed Sanders for preprocessing of images for infrared missile warning (IRMW).

CHAMP consists of eight CHAMP processing elements (PE), which consist of two FPGAs and 16k by 32 bit dual port memory. The processing elements are connected by both a ring network and a global crossbar network. Data sequencing is performed by the timing and control unit. It uses 16 programmable address generators to address the dual port memories of the CHAMP PEs. Due to the parallel address generators and datapaths up to 16 concurrent data accesses are possible. CHAMP performs 1 BOPS and is capable to perform spatial filtering in real time.

Because of the little information on the implementation on the programmable address generators no further conclusions are possible.

The Map-oriented Machine 1

The Map-oriented Machine 1 (MoM-1) was initially called PISA machine (pixel-oriented system for image analysis) See [HHH84a] See [HHH84b]. It has been designed to perform operations on pixel data. Operations are performed on a window cache, which holds the part of the memory needed for current computations. The window position is controlled by the Move Control Unit (MCU, see See The Address Generator of the Map-oriented Machine 1). The MCU is a programmable AG, which supports a limited number of access patterns. The entire cache is connected in parallel to the Problem-Oriented Logic Unit (POLU), the hardware implementation of the function set required to compute the current application. The POLU may be a combinatorial or sequential switching network. See Block diagram of the Map-oriented Machine 1. gives an overview on the MoM-1 system.

The Map-oriented Machine 2

The Map-oriented Machine 2 (MoM-2) architecture is described in detail in See [Rei90] or See [Hir91]. As illustrated in See Block diagram of the Map-oriented Machine 2. the MoM-2 utilizes a flexible programmable data sequencing mechanism (Task Manager and Move Manager described in See The Address Generator of the Map-oriented Machine 2) to provide computation data to the reconfigurable ALU (rALU).

The MoM-2 is integrated in a VME See [Pet89] host computer, where both the host computer and the accelerator share the same bus. As the MoM-1 the MoM-2 utilizes a Data Scan Cache which can be accessed in parallel by the rALU. The data sequencing between the Memory Board and the Data Scan Cache is performed by the JumpGenerator and the Address Generator (Single Step Control Unit) under control of the Task Manager.

The Map-oriented Machine 3

The Map-oriented Machine 3 (MoM-3) is a reconfigurable accelerator, which uses the KressArray-1 (See [HKR94] See [Kre96]) as a reconfigurable ALU (rALU, See [HRW92]). Thus its main control mechanism is a microprocessor based instruction sequencer, data sequencing is performed by flexible programmable generic address generators (GAG, See [HR95], see also See The Address Generator of the Map-oriented Machine 3). See The MoM-3 machine architecture. shows all components and the interface to a host computer.

The MoM-3 is embedded as a general-purpose co-processor in a VMEbus See [Pet89] based workstation (see See The MoM-3 machine architecture.). After setup, the MoM-3 runs independently from the host computer until the complete application is processed. Setup in this case means, that the host software has to load the application data into the MoM-3 data memory, load the GAG parameter sets, the rALU configuration code and the program for the instruction sequencer into the MoM-3 control memory and initiate execution.

The MoM 3 consists of four major parts (see See The MoM-3 machine architecture.):

In contrast to the one-dimensional von Neumann memory space, the MoM-3 data memory is primarily organized to be two-dimensional by splitting the memory address into an x- and y-part like coordinates in a two-dimensional map. It holds the data of the user's programs. The data is distributed in a regular fashion over the data memory given by a mapping scheme (data map). Each scan window out of the rALU serves as a window to the data memory being the processor-to-memory interface of the MoM-3. A scan window holds data words from a local neighborhood as a copy out of data memory. It efficiently supports the exploitation of parallelism within an algorithm. Scan windows are adjustable in size during run time. The data sequencer hardware provides accessing sequences for a controlled scan window movement over the memory space (See Different scan window sizes and an address sequence for a linear scan pattern). Thus the data sequencer represents the main control part of the MoM-3. A generic address generator is able to compute a long sequence of addresses, so-called basic scan patterns, for the data in the data map from a relatively small parameter set. A selection of supported scan patterns is given in See MoM-3 Scan Pattern Examples: (a) Single Steps, (b) Linear Scans, (c) Video Scans, (d) Zig-zag Scan, (e) Spiral Scan, (f) Curve Following (Data Dependent), and (g) Lee Routing (Data Dependent)..

All data manipulations are done by the reconfigurable ALU applied to the data in the scan windows. For the MoM-3 a special reconfigurable datapath architecture (KressArray-1, See [HKR94] See [Kre96]) supporting word level has been developed for the evaluation of any arithmetic and logic expression. Each basic cell of the KressArray-1 serves as an operator and is configured with a fixed ALU and a microprogrammed control.

Conclusions

Reconfigurable computing machines have been examined from different viewpoints: how data memory is implemented, how data sequencing is performed, and whether the sequencers architecture augments memory bandwidth.

In See The Data Memory Organization of Reconfigurable Systems different data memory organizations for reconfigurable computing have been explored. To use large data memories the only solution so far are external commercial memory devices. Some current approaches already use large data memories with burst mode DRAMs (PRISM-II, see See The PRISM-II System and Riley-2, see See Riley-2). More common is the realization of multiple memory banks (mostly SRAM, but in the case of PRISM-II also DRAM) with concurrent access. To obtain high speed or concurrent memory access external memory is rather unsuitable. Too many chip pins of the reconfigurable devices would be needed for external memory connection and would lower the transmission speed.

For address generators there may be solutions distinguished with hardwired address generators or synthesized address generators (mapped into the reconfigurable platform). The hardwired sequencers usually do not utilize any reconfigurability features from the rest of the system. The synthesized address generators, having been mapped onto reconfigurable parts of the system, may feature some optimization with respect to a particular application. None of the systems shows any systematic to improve memory bandwidth by a clever sequencing concept other than sequencing by software, nor by synthesized address generators, mapped on the reconfigurable parts. None uses larger RAM banks on board of the reconfigurable chip.

The new accelerator / memory communication concept introduced by See A Novel Data Sequencing Concept and See Data Sequencer use for Higher Memory Bandwidth will utilize the flexibility of multiple synthesized sequencers to substantially improve the memory bandwidth.


1. For information on systolic arrays refer to See [Har94].
2. RISC (reduced instruction set computer) architectures have been introduced in See [Mil88].
3. For more information on direct memory access (DMA) refer to See [Bae94].
4. RISC (reduced instruction set computer) architectures have been introduced in See [Mil88].

5. There is no memory embedded into the reconfigurable cell array. But MorphoSys is a System on Chip (SoC) and has memory blocks on chip.

6. The peak performance is a theoretical value given in the data sheets. Because of refresh cycles and command sequences it is below the interface speed.

7. Performance calculated using two devices to obtain a 16 bit interface as all other memory types.

8. For information on fast fourier transformation (FFT) refer to See [GW77].
9. RISC (reduced instruction set computer) architectures have been introduced in See [Mil88].
10. For more information on conventional memory management units (MMU) refer to See [Bae94].
11. For information on fast fourier transformations (FFT) refer to See [GW77].
12. RISC (reduced instruction set computer) architectures have been introduced in See [Mil88].