|
With the increasing speed of todays computational devices (microprocessors as well as configurable devices, etc.), the memory interface speed becomes the limiting factor for the overall system throughput See [HS96]. Several techniques1 to alleviate this problem have been developed. Since most of these techniques focus on the locality of instructions in a program See [KT96], these methods can not be applied to reconfigurable computing machines, as they are configured before program execution See [BHH98] and computation data doesn't has any loops.
Since an accelerator for an application (hopefully) has higher throughput than its software solution, the accelerator makes the memory bandwidth problem worse. This means, memory/accelerator communication architectures are mostly more challenging for reconfigurable computing platforms than for procedural platforms (i.e. von Neumann like). Acceleration mechanisms are mainly based on:
The result of these acceleration mechanisms is an increased computation speed and data throughput. Because of reconfigurable accelerators are able to process data faster than microprocessors, it must provided in the same time by the memory interface. Therefore sequencers or address generators are important means to create memory interfaces or powerful memory architectures such as e.g. interleaving or burst mode capability.
While most research projects still struggle with general architectural issues, the memory communication problem is undervalued in the area of reconfigurable computing. Usually a straightforward application specific data sequencer implementation is chosen to address small SRAMs2, ignoring that the lack of a common concept causes a lot of overhead in several areas.
Already for microprocessors the memory communication bandwidth has become worse and worse for each new technology generation. Examples have been published where more than 75% of computation time of a state of the art microprocessor the processor is waiting because of busy memory See [PAC97]. But for accelerators, especially for data-intensive applications like image processing, the memory communication bandwidth problem is drastically more dramatic than with microprocessor usage. An novel universal concept must addresses this topic with several strategies to solve memory communication performance problems under consideration of the specific characteristics of reconfigurable architectures.
Previous MoM architectures (see See Reconfigurable Systems using Data Sequencing) are based on external memory devices. But systems on chip (SoC), faster memory architectures, and their availability as IP3 core create a new situation. Compared to multi chip solutions SoC provide faster and more interconnect resources. At present there is an increasing trend to integrate smaller and larger reconfigurable parts in SoC See [Rab00] See [MOR] See [MAL] See [SIL] in particular for network processors See [Cra99] See [Gil99] See [Wal99]. Therefore it becomes newly a commercially relevant approach to integrate coarse-grain reconfigurable accelerators with memory on chip (see See [LSL99] See [Tri99] See [Sid99]), because on chip memory is expected to drastically shorten the cycle time.
Further new highly flexible synthesis methods for coarse-grained reconfigurable accelerators See [HHH00] support the integration of efficient solutions for the memory bandwidth problem. This is in so far important, as completely universal accelerators are an illusion and application domain specific solutions are needed. IRAM See [KP98] See [KAP97] See [PAC97] seem to have tried a fine grain or medium grain mixture or merging of logic and memory on the same chip. But it seems to be less promising because of the lack of good algorithms to map applications onto such a platform. Therefore in this thesis a solution with a clear architectural separation of processing block from memory module (also including independent memory banks on the same chip) is preferred.
This thesis will focus on the memory communication problem of reconfigurable computers based on coarse-grained architectures. It gives a general overview on relevant technologies for the presented work. This overview is organized in four chapters. First several coarse-grained reconfigurable architectures will be presented as a target platform for reconfigurable computing systems (See Coarse-grained Configurable Architectures). After that current memory technologies and their mechanisms to accelerate memory accesses will be explained (See Memory Technologies). Interesting address generators will be presented (See Earlier Address Generators), and then the data sequencing concepts of existing reconfigurable computing machines will be discussed (See Reconfigurable Systems using Data Sequencing).
As a basis, on which the memory communication bandwidth is improved, the thesis will present an address generation concept for sequencing (See A Novel Data Sequencing Concept). Starting with a classification of memory accesses, special operators for data sequencing will be developed, taking into account, that they will be used for reconfigurable computing, i.e. the configuration load has to be minimized. On this basis a new basic access pattern has been determined. With further combinations of the basic pattern a general model of a generic data sequencer has been developed, which uses a parameter stack to provide high flexibility in access sequences. Based on this model two hardware implementations of generic data sequencers will be presented: a mapping to the coarse-grained KressArray-3 (See The Data Sequencer Mapped to the KressArray), and a hardwired version to be used in the Map-oriented Machine with Parallel Data Access (MoM-PDA, See The Map-oriented Machine with Parallel Data Access).
On hardware level a special 2-dimensional data memory organization enables the exploitation of speed-up mechanisms like burst mode and concurrent data accesses. The exploitation of the hardware level optimizations is further enhanced by loop-unrolling, modification of storage schemes and scheduling of the accesses (See Data Sequencer use for Higher Memory Bandwidth).
See Table of direct influences of previous work, the technological and theoretical basis for the presented memory communication concept. gives an overview on the direct influences of previous work, the technological and theoretical basis for the presented memory communication concept.
Since computational machines based on data sequencing follow a new computing paradigm, which is quite different to the von Neumann paradigm, rethinking is necessary. To assist programming the data sequencer an Internet-based development framework, the Xputer Multimedia Development Framework (XMDS), has been developed. It supports the programmer in all development phases with a comprehensive toolset. Besides several possibilities to enter applications, it provides different tools for program validation and design output. One key feature of the XMDS is the way it is implemented. As an Internet-based software, the XMDS is installed on a Web-server. The user invokes the system by addressing a specific URL, then the XMDS is dynamically downloaded to the users computer.
The XMDS is introduced in See A Development Framework for Data Sequencers and will be utilized for the application example in See An Application Example.