|
The data sequencing concept presented in See A Novel Data Sequencing Concept is versatile as it may be the basis for various implementations. This chapter focuses on a hardwired data sequencer implementation being targeted to build a custom computing machine (CCM) on the basis of field programmable devices. This type of machine is commonly called FPGA-based custom computing machine (FCCM). This FCCM should implement an data sequencing mechanism independent from the computational datapath. The prototype is called Map-oriented Machine with Parallel Data Access (MoM-PDA, see See The MoM-PDA Overall Architecture). It will support concurrent data access with 2 parallel banks of Multibank DRAM (MDRAM, see See Multibank DRAM Technology).
The MoM-PDA is the fourth generation machine in the series of Map-oriented Machines (see See The Address Generator of the Map-oriented Machine 1, See The Address Generator of the Map-oriented Machine 2, and See The Address Generator of the Map-oriented Machine 3). All implementations of this series are based on the Xputer paradigm (see See The Basic Xputer Architecture, See [HHW89], See [HHS92], or See [AHR94]), utilize a data sequencer, and operate on 2-dimensional data memory. While all earlier implementations are generally suited to utilize reconfigurable devices, the MoM-PDA is the first prototype, which has been optimized for this task. But the most important novel feature of the MoM-PDA is the first comprehensive realization of memory access optimizations. The first time the 2-dimensional memory organization is exploited in this respect.
The hardwired data sequencer is based on a parameter stack (see See A Stack Mechanism for the Generation of Complex Address Sequences). It provides the complete hardware needed to execute a 2-dimensional video scan. A parameter stack holds multiple video scan descriptions, which supports the generation of nested-, meshed-, and compound scans. Further this concept has been extended to a multitasking data sequencer, i.e. the hardwired data sequencer supports the generation of multiple scan patterns and thus the movement of multiple scan windows at the same time.
This chapter is structured as follows. First the MoM-PDA architecture will be introduced. The major focus will be the data sequencer of the MoM-PDA. After the detailed description of the data sequencer, the hardware details of the MoM-PDA will be reported.
The MoM-PDA (See [HHH99b], See [HHN99a]) is an accelerator to be connected to a host computer via a PCI interface. The most important new feature of the MoM-PDA prototype is concurrent high speed access to the data See [HHH98c]. Therefore it has 2 parallel banks of Multibank DRAM (MDRAM, see See Multibank DRAM Technology). Addresses for the MDRAMs are computed by the Data Sequencer and extended with burst information by the Burst Control Unit (BCU, see See Interfacing Multibank DRAM and See The Burst Control Unit, or See [Bed98]).
Computations are usually performed by the KressArray. Therefore data is first routed to a reconfigurable ALU port (rAP, see See The Reconfigurable ALU Port, or See [Gil98]). The rAP acts as an interface to the memory subsystem and implements some glue logic. For small applications the rAP may also be used to perform computations. In that case the KressArray is not needed.
All components of the MoM-PDA are implemented with field-programmable logic. See The MoM-PDA machine overview. gives an overview on the general machine architecture. The host computer is connected to the data sequencer, the burst control unit (BCU), and to the rAP via configuration lines. Programming of the KressArray and of the MDRAMs is performed indirectly via the rAP or the BCU respectively. To enable concurrent data access all address and data lines are implemented twice. During operations the machine is controlled by the data sequencer. The host initiates only computations and observes computations via the status and control lines.
See The MoM-PDA Board deals with the MoM-PDA prototype board, which contains all elements needed to execute simple applications. A KressArray emulator has been implemented and is described in See [Zim99]. In See The KressArray Emulator it is briefly introduced. The PCI interface is explained in See The PCI Interface Board.
In the following the address generation pipeline of the MoM-PDA will be explained in detail. After that the multitasking capabilities will be elucidated.
The address generation of the MoM-PDA is performed by four units, which form a pipelined datapath with three stages. The pipelined architecture forms an efficient implementation of the address generation datapath, because all units of the address generator operate simultaneously.
According to the concepts introduced in See A Novel Data Sequencing Concept, the Handle Position Generator (HPG) generates handle positions on the basis of the slider model (see See Illustration of the slider model [HBH97a] and generated scan sequence for x-component of the video scan in figure 6-5 at page 98.), which has already been proposed in See [HHW90]. The handle positions are the basis for the second pipeline stage, where the Scan Window Generator (SWG) performs memory accesses according to the scan window model and under application of the hardware level optimizations described in See The Hardware Level Support for Memory Access Optimization.
While the generation of the handle position is independent from the number of parallel memory banks, the SWG has to generate accesses for both banks concurrently. Therefore all datapaths behind the SWG exist twice.
A Memory Mapper (MemM) optimizes the mapping of the 2-dimensional memory to the linear memory banks. After that the burst informations provided by the SWG are used by the BCU to access the MDRAM devices.
See The address generation datapath of the MoM-PDA data sequencer. pictures the address generation datapath. The necessary hardware is distributed to two devices:
In this subsection the hardware required for handle position generation will be described. Therefore the HPG utilizes two identical x- and y-generic address generators (GAG, see See The address generation datapath of the MoM-PDA data sequencer.). The context switcher shown in See The address generation datapath of the MoM-PDA data sequencer. will be explained in See The MoM-PDA Memory Mapping.
Both GAGs (x and y) provide the complete hardware to generate 1-dimensional video scans. In that way the GAGs form a video scan generator according to the principle explained in See The Two-dimensional Video Scan Generator. To generate compound, nested, and meshed scans the GAGs employ a parameter stack. This parameter stack works according to the parameter stack method introduced in See A Stack Mechanism for the Generation of Complex Address Sequences. See The 1-dimensional Generic Address Generator of the MoM-PDA data sequencer. illustrates the structure of a single GAG.
Since the MoM-PDA data sequencer is implemented with an Altera CPLD (see See The Data Sequencer) its design is a compromise between performance requirements and available design space, i.e. the Limit-, Base-, and Address Steppers have been optimized for their specific purpose. Further the Handle Position Generator is designed without the step counter proposed in See The Two-dimensional Video Scan Generator. If a step counter is needed the parameters of the steppers are saved to the stack, and one of the steppers itself is utilized for step counting. While this method is slower it saves valuable design space of the CPLD. The counter mode of the stepper is described in detail in See [Buc99b]. To illustrate the optimizations of the Base-, Limit, and Address Steppers, they are explained below.
The Base- and Limit steppers of the MoM-PDA data sequencer have an identical design. It is pictured in See The MoM-PDA Base / Limit stepper implementation.. All generated addresses are 16 bit values. Therefore the complete datapath is 16 bit.
The required parameters for Base- or Limit slider implementation are stored in a parameters stack. The MoM-PDA parameter stack has a capacity to store 64 video scans. For a single Base- or Limit stepper the parameter stack has four entries:
All values of the parameter stack are initialized by the host computer. For this the init signal is set high and the multiplexers of the parameter memory hand over the control over the parameter stack to the host computer. For this the Address_bus and the Configuration_Data lines, driven by the PCI interface are connected to the parameter stack.
During address generation the init signal is low. The data input of the parameter stack is connected to the only read/write parameter, the Slider Position . The stack pointer SP points to the parameter set of the actual video scan and the area signals address the current stack window position.
For relative video scans, e.g. the inner scan of a nested scan, the stepper provides a register for the Scan Pattern Location . During parameter initialization the actual address of the outer video scan is stored in this register.
For stepper operation the Slider Position is initialized at the Initial Position . A relative scan is generated by adding the value of the Scan Pattern Location register. For initialization the input BL_sel of the input multiplexer of the Slider Position register is set to zero.
For re-initialization of the stepper with already used parameters from the parameter stack, the Slider Position register is also loaded from the parameter stack. For this the BL_sel input of the input multiplexer of the Slider Position register is set to two.
During normal stepper operation the Step Width is added to the actual position each step to generate the next Slider Position . For this the BL_sel input of the input multiplexer of the Slider Position register is set to one.
The stepper generates new Slider Position s until the Slider Position exceeds the End Value. See The escape unit. Different escape conditions: (a) End>Initial position, (b) End<Initial position, and (c) hardware implementation of the escape unit. shows the two possibilities for the end of address generation. For the MoM-PDA data sequencer the End Value is relative to the Initial Position . Therefore the sign of the End Value is used to determine, which relation has to be chosen to detect the End_Signal . See The end detection unit of the MoM-PDA Base- and Limit steppers. gives an overview on the End Detection Unit. The End Value and the Initial Position are added to get an absolute value to be compared with the current Slider Position . If the End Value is negative, the addition results in a subtraction. Based on the sign of the End Value a multiplexer selects the output of the "greater as" or "lower as" relation as an End_Signal .
For more details on the MoM-PDA Base- / Limit stepper please refer to See [Buc99b].
The address stepper has been simplified to save design space of the target CPLD. The parameter stack is the same as for the Base- / Limit stepper (see See The 1-dimensional Generic Address Generator of the MoM-PDA data sequencer.). The optimized Address Stepper is shown in See The Address stepper of the MoM-PDA data sequencer.. Only two values are accessed from the parameter stack:
The Initial Position for the Address stepper is obtained from the Base stepper. The actual Base value is not stored in a register since the Base stepper already contains a register for Base . Also the End Value Limit is not stored in a register, since the Limit stepper provides a register for Limit .
For the Address stepper the Initial Position ( Base ) and End Value ( Limit ) are absolute parameters. Similar to the Base - and Limit generation the End Value may be higher or lower as the Initial Position . Therefore the Step Width must be adequately chosen, i.e. it must be positive or negative.
The End Detector of the Address stepper uses the sign of the Step Width to select between the "greater as" or "lower as" relation to drive the End_Signal .
Scan window generation is the second stage of the two level data sequencing. On this stage all memory access optimizations are performed. Therefore the scan window generation of the MoM-PDA data sequencer includes two important steps:
See The Principle Data Sequencer Architecture has already introduced a basic scan window generator architecture for one memory bank. See The Scan Window Generator of the MoM-PDA data sequencer. shows the overall MoM-PDA scan window generator structure. It consists of the following units:
Each offset generator produces relative addresses to be added to the actual handle position. If a new handle position is generated, the offset generator starts generating addresses. Since the accesses inside a scan window are often irregular, they are stored in a look-up table. See The organization of the Offset Memory of the MoM-PDA Scan Window Generator. shows the organization of the look-up table of one offset generator. It has a capacity of 16 different scan windows (Task 0-15) each of 512 entries.
The offset generator (see See The offset Generator of the MoM-PDA Scan Window Generator.) selects between the different scan windows on the basis of the signal Task_Nr . This four bit signal is used as the MSW1 of the Offset_Memory_Address . The main components of the offset generator are an incrementer and a 16 word by 9 bit register file. Each word of the register file is assigned to a scan window and selected by Task_Nr . The register file holds pointers to the look-up table memory. This way the current state of the scan windows may be saved.
The output of the look-up table ( Offset_Memory_Data ) may also be used to load the register file with a new value. For this the signal detect must be set high. All other multiplexer are needed for reconfiguration of the scan window generator by the host computer. The signal init is used to pass the control of all address lines and control signals to the host.
For more details please refer to See [Buc99b].
To optimize the exploitation of the available memory resources the MoM-PDA performs an enhanced memory mapping. This mapping is needed for two reasons:
Usually in multitasking systems it is not known during design time, which applications will be executed at the same time. Multitasking systems respectively the runtime system of multitasking systems perform a dynamic scheduling of application tasks. This may cause a fragmentation of the data memory, or that two application use the same address range. In both cases a hardware support for re-mapping of the application data is needed.
This strategy leads to an own address space for each data block in the 2-dimensional memory. If there is data for several applications in the physical memory the runtime system has to secure that there is no memory violation. For each application a Memory Base address is added and the necessary space is allocated. The required mapping is done by the context switcher, which is located in the HPG (see See The address generation datapath of the MoM-PDA data sequencer.). The Memory Base address simply moves the application data in the data memory to the allocated location. It is determined dynamically by the runtime system. To achieve an arbitrary movement the Memory Base address may be positive or negative. See Data of several applications mapped to the parallel memory banks. illustrates the data of three applications mapped the parallel memory banks.
See Context switcher implementation of the MoM-PDA data sequencer. shows the context switcher hardware. Two 16 word by 16 bit register files hold the x- and y- Memory Base address for all applications. The appropriate Memory Base address is selected by the Task_Nr signal and added to each handle position generated by the HPG.
The x- and y-parts of the handle position are composed as described in See A Mapping Scheme to Solve Data Locality Coherence Problems to avoid memory fragmentations caused by unused MSBs of the x address. See The memory mapper of the MoM-PDA data sequencer. gives an overview on the memory mapper (MemM) implementation. Since two parallel address streams are generated by the SWG, the memory mapping hardware is needed twice. The shift operations for each memory bank is the same. A barrel shifter may shift the y part of the address between zero and 15 bits. A 15 bit or gate merges the output of the barrel shifter and the 15 MSBs of the x address. This mapping scheme represents an improvement compared to the Map-oriented Machine 3, where only four fixed address mappings are supported (see conclusions of See The Address Generator of the Map-oriented Machine 3 at page See Comments on the MoM-3 Address Generator).
To support memory devices with burst options an additional unit has to control the burst operations. The required signals for variable burst lengths are generated by the Burst Control Unit (BCU) based on the burst information provided by the SWG. Because the memories are accessed in parallel, the BCU hardware is instantiated for each memory bank.
This section will give an overview on the Burst Control Unit (BCU) characteristics and functionality. For a detailed description of the BCU refer to See [Bed98]. For implementation details see See The Burst Control Unit. A description of the Multibank DRAM (MDRAM) functionality is given in See Multibank DRAM Technology.
The MoM-PDA memory is divided in two concurrently accessible sectors (see See Dynamic assignment of scan window positions to memory banks during scan window movement: (a) before scan window movement, and (b) new assignment after scan window movement.) in order to implement a parallel memory system as described in See The Memory Architecture. Thus, the BCU is connected to two parallel MDRAM busses and all address and handshake lines to the data sequencer and the rALU must be instantiated twice. The structure of the BCU/MDRAM bus system is shown in See Schematic of the Burst Control Unit / MDRAM bus system of the MoM-PDA..
According to the two bank interleaving scheme, some of the BCU components are instantiated twice and others only once. The components for the MDRAM setup and initialization via the host computer are only active before and after scan window operations. Thus, the memory sectors don't need to be accessed simultaneously for these tasks and these components exist only once. All other parts have two instances, because during scan window operation the sectors must be accessed concurrently.
During scan window operation, every read/write burst is initialized by a burst request signal from the SWG. When generating a burst request, the SWG has to supply a parameter set describing the burst. It contains the start address of the burst, the burst length and the burst type (read or write). Based on this information the burst split unit decides, if the burst data can be read/written completely from/to a single line of a DRAM bank. That is, if the start- and end- (see See end address = start address + burst length) address of the burst are located in the same bank.
If this is not the case, the burst will be split into two smaller bursts executed subsequently in different DRAM banks. To ensure that both parts of the burst will be located in different banks, internally an interleaving scheme is used for bank addressing (see See Bank interleave addressing scheme of the Map-oriented Machine with Parallel Data Access (MoM-PDA, see appendix D.1 at page 262).). The benefit of this scheme is the possibility to use hidden RAS (refer to See Multibank DRAM Technology for hidden RAS). If both sub-bursts would be located in different rows of the same DRAM bank, the second row could not be activated before the first sub-burst is completed. When the rows are located in different banks, both banks can be activated before the burst starts. See Bank interleave addressing scheme of the Map-oriented Machine with Parallel Data Access (MoM-PDA, see appendix D.1 at page 262). clarifies the addressing scheme. The start address of a burst consists of the three parts: bank address (in MoM-PDA 6 bits), row address (8 bits) and column address (5 bits) in the following order:
Besides the address bits for addressing each parallel bank, another address bit is used for selection of the bank to be addressed when loading the memory via the host computer interface.
The burst split unit (see See Structure of the MoM-PDA Burst Control Unit.) also performs any handshake to the data sequencer. After receiving a burst request from the data sequencer, the burst split unit will forward the start address and the burst length to the register file, which contains registers and counters for burst control. The burst control state machine is also informed about the burst request. If neither a burst nor a refresh is in working on, the burst control state machine will activate the appropriate banks for the requested bursts using the addresses stored in the register file. After that, control is passed on the read/write logic unit, which mainly performs handshake to the rALU. Performing a burst, the register file will hold the number of the currently transferred data words in order to generate a ready signal when the burst is completed. The ready signal will pass control back to the burst control unit, which will precharge the banks, using addresses from the register file again.
The burst control state machine is the most complex part of the BCU. Its task is to organize the incoming burst request in an optimal way. Further it detects if two successive bursts take effect to the same row of the same bank. In that case no precharge/activate operation is necessary in between.
All addresses for the controlled MDRAM are generated either by the register file or by the refresh logic. The address multiplexer will select the currently active source for the addresses. When a burst operation is in work, data will transferred over the ADQ bus (see See Schematic of the Burst Control Unit / MDRAM bus system of the MoM-PDA. and See Structure of the MoM-PDA Burst Control Unit.), either driven by the rALU or by the MDRAM itself. During data transfer, the address multiplexer, the host computer interface connection, and the reset and bank ID reprogramming unit are disconnected from the BCU-internal part of the ADQ bus.
The refresh logic contains a counter which is initialized with an appropriate value and then decremented by one every clock cycle. If the refresh counter is zero, refresh logic will generate a refresh request signal. If no other operation is in work, refresh logic will get an OK signal from all other units and a quadruple refresh will be performed immediately See [Bed98]. Otherwise, the currently running burst operation must be interrupted. The read/write logic unit will send a STOP command to break the burst and send a hold signal to the rALU. Just then it will generate the OK signal for the refresh logic. When the refresh operation begins, the refresh counter is re-initialized to its start value.
Reading and writing MDRAM through PCI interface is almost the same process as data accesses initiated by the data sequencer except the fact that only bursts of the length 1 are performed. The bus interface buffers two 16bit data words and an address from the host interface. Then a length one burst request is sent to burst split unit and the burst is initialized as described above. The read/write logic unit performs the handshake to the bus interface instead to the rALU. The buffered data words are written directly to the BCU-internal ADQ bus.
MDRAM has to be initialized before scan window operations. Setting up includes the following processes:
The setup process is performed by the reset and bank ID reprogramming unit. This unit is started immediately after a hardware reset. When the setup procedure is completed, the init_ready signal will be generated to enable the other units for normal operation.
Because of the stack based implementation of the HPG, all steppers provide an infrastructure for a quick parameter exchange. This infrastructure can also be exploited to set up a multitasking system. Parameters are not only changed to generate complex scan pattern but also to run multiple scan patterns simultaneously. This novel data sequencer structure handles up to 16 parallel tasks each consisting of a complex scan pattern. Therefore up to 16 scan windows may operate on the data memory concurrently.
The computation of the parallel tasks is done like known from multi-tasking systems. All tasks have the same priority and are stored in the Task-List (See The task manager of the MoM-PDA data sequencer.). The Task-List is a 16 word by 6 bit memory which holds pointers ( SP ) to a Scan Parameter List and the Stepper Parameter Stack. The Task-List is processed according to the Round Robin method, i.e. the Task-List is processed in an infinite loop. Each task generates exactly one handle position before the next task is called by the Stepper Control FSM. The Stepper Control FSM sets the signal inc_task to trigger the Mod 16 Counter to increment the Task_nr .
The Stepper Parameter Stack holds all parameters for the HPG (see also See The MoM-PDA Base / Limit stepper implementation.). The Scan Parameter List stores pointers for each video scan to other video scans for the generation of compound-, nested-, and meshed scans. If the HPG indicates with the end_condition signals, that another video scan has to be invoked, the Stepper Control FSM loads a new pointer from the Scan Parameter List into the Task List. If the computation of a task has finished the Stepper Control FSM resets the running flag of the current task in the Task-List via the task_parameter lines.
For more details on the multitasking concept please refer to See [Buc99b].
In this chapter the hardware components of the Map-oriented Machine with Parallel Data Access (MoM-PDA) will be introduced. The concepts of the MoM-PDA data sequencer have been described in See The MoM-PDA Data Sequencer. The complete MoM-PDA prototype consists of three printed circuit boards (PCBs):
In the following subsections these boards and the main components utilized on these boards will be described.
To implement the interface to a host computer, the MoM-PDA prototype utilizes a commercial FPGA board. The FPGA of this board is used to implement glue logic to interface the MoM-PDA components. See XC6216 PCI Board from Virtual Computer Corporation. shows the used HOT Works PCI Board by Virtual Computer Corporation See [VCC]. The FPGA board consists of:
The XC6216, the SRAM and a set of configuration registers are mapped to the memory space of a host CPU. This allows the host CPU to read or write the SRAM memory of the board and configure or read or write the user FPGA and the user design. The board memory is organized into two banks (Bank 1 and Bank 2, see See Hot Works Board Architecture.) of 128Kx8 SRAMs. The banks have separated address and data buses. With the use of multiplexers and bus switches various modes of operation are allowed.
Configuration of the XC6216 is done using the east address and west data ports (see See Hot Works Board Architecture.). The control port of the XC4013 is needed to switch the multiplexer and bus switches to the different modes of operation. During read/write operations from PCI to board memory the XC6216 must keep its output ports (east data) in high impedance state. The control port is also used to program the clock generator.
The main components of the prototype are placed on the MoM-PDA prototype board (See MoM-PDA prototype board.). The complete prototype implementation is based on FPGAs. The board contains the Data Sequencer, the reconfigurable ALU Port (rAP) and the Burst Control Unit (BCU). Further 2 parallel banks of Siemens MDRAM chips are contained. Each MDRAM chip has a capacity of 1MB and operates at a clock frequency of 120 MHz.
The MoM-PDA prototype board has 3 connectors:
The concepts of the external data sequencer used in the MoM-PDA have already be described in See The MoM-PDA Data Sequencer. In this subsection only the technical details of the CLPD2-based implementation will be listed. The target architecture for the implementation is an Altera FLEX310k100 GC503-4 CPLD.
More details can be found in See Reconfigurable Architecture With Embedded Memory, See [Bra98], See [Buc99a], See [Buc99b], and See [Alt98]. See Altera FLEX 10k100 device with adaptor board. illustrates the CPLD adaptor developed in See [Bra98] with the Altera FLEX 10k100 CPLD chip. See Data sequencer implementation details. lists details of the CPLD implementation of the data sequencer. Technical details of the data sequencer are given in See Data sequencer details., and of the scan window generator in See Scan window generator details..
This chapter will give a technical description of the hardware components used for the memory system of the MoM-PDA. A description of the MDRAM functionality can be found in See Multibank DRAM Technology. For a more detailed description especially with regard to timing values, package measurements and maximum ratings refer to the MDRAM data sheet See [Sie97]. MDRAM components are available in different memory sizes and clock frequencies. In the MoM-PDA, a MDRAM type with a memory size of 1 MB (= 256k Words) and a clock frequency of 120MHz is utilized. A photo of this component is shown in See SIEMENS MDRAM device..
Since all MDRAM types use a supply voltage of 3.3V and the Xilinx FPGAs 5V, a 3.3V/5V transceiver is used for coupling these components. The transceiver is described in See [Bed98].
The memory controller called Burst Control Unit (BCU) of the MoM-PDA is implemented using the Xilinx XC4000 FPGA family See [Xil99]. For details of the implementation refer to See [Bed98]. A floor plan of the MDRAM controller in the XC4013e FPGA is given in See Floor plan of the Burst Control Unit on the Xilinx XC4013e chip..
The main task of the reconfigurable ALU Port (rAP) is to form an interface between the memory system and the rALU. Since MDRAMs use the same bus for address- and data transfers (see See Multibank DRAM Technology), the rAP is directly connected to the memory bus and must follow its restrictions.
The rAP is implemented with a Xilinx XC6200 FPGA (see See Xilinx XC6216 FPGA: (a) device, and (b) die photograph [MV97]. and See [Xil96]). Its programmable space is partitioned into two functional units.
Any transmission between rALU and MDRAM is performed over a 16-bit bus using handshake signals. The MDRAM has an internal 32-bit memory organization but data words are split into two 16-bit words. The first half-word is read/written with the rising clock edge, the second half-word with the following falling clock edge of the system clock ( SYSCLK , see See Timing of an rALU-interrupted read burst.). For a detailed description of the interface refer to See [Bed98].
Unfortunately the XC6200 FPGA See [Xil96] is not capable of clocking flipflops with the rising edge and the falling edge. In addition, inverting the SYSCLK signal will lead to clock-skew problems. Therefore, the rAP is clocked by DCLK , a clock signal with the double frequency of SYSCLK . The rising clock edge of DCLK is used to sample the read-/write signal ( RS/WS ) and the data ( DB ).
In some situations the rALU may be not fast enough to proceed all data in the speed required by the memory system. In that case the rAP must interrupt the data transmission. An interrupted read transmission is shown in See Timing of an rALU-interrupted read burst.. The rAP must sample RS and DB every rising edge of DCLK . If RS is active, a transmission is in progress. A 0/1 counter is used to distinct between the lower halfword and the upper halfword. A write transmission is illustrated in See Timing of an rALU-interrupted write burst.. If the rAP samples WS active on the rising edge of DCLK , it must drive DB in the following two clock cycles for a complete 32-bit transfer. In both cases the rALU can assign HOLD to interrupt the transfer, but the actual transmission must be finished because the Burst-Control-Unit samples HOLD every rising edge of SYSCLK .
To perform parallel data access two interfaces, as described above, are integrated into the rAP.
The MoM-PDA rALU consists of two parts:
The application dependent part of the rALU may be implemented in the KressArray (See The KressArray Emulator) or in the rAP. Implementing applications directly in the rAP is only possible for rather simple tasks. Some example applications have been presented in See [Gil98] and See [HHG98]. See The MoM-PDA rALU organization. shows the rALU organization for an application implemented in the rAP only. If an application is implemented in the KressArray the rAP implements besides the MDRAM interface some glue logic to interface the KressArray. A Pipe Control Unit is not needed in that case.
The MDRAM-Interface (See MDRAM interface diagram.) is an application independent device. Its major function is to distribute a maximum of four 16 bit transfers initiated by the Burst Control Unit every SYSCLK -cycle (two DCLK -cycles) to 64 input-flipflops for rALU-Read operations and to switch 64 output-flipflops to four 16 bit transfers for rALU-Write operations. The ReadStrobe ( RSx ) and WriteStrobe ( WSx ) signals distinct between read- and write-operations to or from the two MDRAM banks.
The MDRAM interface control must select between two 16-bit registers to form the rALU-Write operation and distribute the rALU-Read transmission to the input registers. After a rALU-Write transmission is completed, a new set of values is loaded into the output registers.
See The KressArray emulator. shows the KressArray emulator See [Zim99]. It can be configured to implement different KressArray architectures and is especially designed for testing of applications. In the MoM-PDA environment it may be used to perform computations.