Instruction systolic array (ISA)Application programs 
The standard algorithm for matrix multiplication is the following, here described for nnmatrices
Matrix multiplication algorithm  
Input:  nnmatrices A = (a_{i,j}) and B = (b_{i,j}) 
Output:  nnmatrix C = A·B 
Method: 

The following program shows the first iteration of an ISA program for multiplication of two nnmatrices. The first column of matrix A is input at the left border of the array, the first row of matrix B is input at the upper border.
The first instruction, denoted by a solid arrow symbol, is C,R0:=CW. This instruction reads the contents of the communication register of the western neighbour and writes it simultaneously into the own communication register C and into register R0. In each processor, the contents of register R0 (and later, the contents of R1 and R2) is shown at the top, the contents of the communication register at the bottom. The second instruction is C:=CN. It reads the contents of the communication register of the northern neighbour and writes it into the own communication register.
On completion of the input each processor (i,j) has matrix elements a_{i,1} and b_{1,j} available and can compute the inner product step c_{i,j} := c_{i,j} + a_{i,k}· b_{k,j}, here for k=1.
This is done by the next two instructions. The third instruction R1:=R0*C multiplies the two values a_{i,k} and b_{k, j} and stores the result (here denoted as d_{i,j}). The fourth instruction R2:=R2+R1 accumulates the products a_{i,k}·b_{k,j} in c_{i,j}.
Figure 1: Matrix multiplication step 
In the next iteration step, the second column of matrix A is input at the left border of the array, the second row of matrix B is input at the upper border, and so on. After n iteration steps, the product of the two matrices is accumulated as values c_{i,j} in register C2.
Each iteration step consists of a constant number of instructions, thus the entire matrix multiplication takes Θ(n) steps .
In the programming language Laisa the complete program is expressed as follows:
procedure matrix_product(n : integer); begin(n) for k:=1 to n do begin < set CW, C0; 1*; 1* >; { C,R0:=CW } < set CN, C; 1*; 1* >; { C:=CN } < mult R0, C, R1; 1*; 1* >; { R1:=R0*C } < add R2, R1, R2; 1*; 1* >; { R2:=R2+R1 } end end; 
In the actual implementation of the ISA, a floating point number is represented by 80 bits using 5 registers, each of 16 bit length. Then of course, input of a column or row takes 5 instructions. A floating point multiplication takes 16 instructions, and a floating point addition takes 18 instructions, including a normalization.
Altogether, multiplication of two floating point nnmatrices takes 44n instructions.
The Warshall algorithm computes the transitive closure of the edge relation of a graph with n vertices.
Warshall algorithm  
Input:  adjacency matrix A = (a_{i,j}) of a graph G with n vertices 
Output:  adjacency matrix A^{+} of G^{+} 
Method: 

The Warshall algorithm can be mapped efficiently to the ISA in the following way [Lan 88].
Let a_{i,j} be stored in processor (i,j). In order to perform the basic operation for the first iteration step (k = 1), each processor (i,j) needs the following data items:
The last two data items can be distributed to the processors by a broadcast in horizontal and vertical direction, respectively.
But, however, in the subsequent iteration steps, a_{i,k} has to move to all processors (i,j). This can be done by a broadcast only for those processors (i,j) where j>k, i.e. where the data item has to move to the right. If j<k, the data item has to move to the left, and this cannot be done by a broadcast. A broadcast operation can only be done from left to right with a constant number of instructions.
The same problem occurs for data items a_{k,j} that are needed in all processors of column j.
In order to be able to perform broadcast operations in each iteration step, a ring shift of the data items a_{i,j} in horizontal as well as in vertical direction is applied after each iteration step. This assures that data item a_{i,k} is in the first column when it its needed in all columns, and data item a_{k,j} is in the first row when it is needed in all rows. Then, these data items can be distributed by broadcast operations.
The following program shows how the data items a_{i,j} are permuted among the processors (i,j). After n such ring shift operations, i.e. at the end of the n iterations of the algorithm, data are again in their original places.
Figure 2: Ring shift in horizontal and vertical direction 
The following Laisa program realizes the transitive closure algorithm.
procedure transitive_closure(n : integer); begin(n) for k:=1 to n do begin < set C, R0; 1*; 1* >; { R0:=C } broadcast_row(n, 1*); < set C, R1; 1*; 1* >; { R1:=C } < set R0, C; [1]; 1* >; { C:=R0 } broadcast_col(n, 1*); < and R1, C, R1; 1*; 1* >; { R1:=R1 and C } < or R0, R1, C; 1*; 1* >; { C:=R0 or R1 } ringshift_row(n, 1, 1*); ringshift_col(n, 1, 1*); end end; 
Next: [Program transformations] or 
H.W. Lang FH Flensburg lang@fhflensburg.de Impressum © Created: 17.05.2001 Updated: 11.10.2008 