# Instruction systolic array (ISA)

## Application programs ### Matrix multiplication

The standard algorithm for matrix multiplication is the following, here described for n × n-matrices

 Matrix multiplication algorithm Input: n × n-matrices A = (ai,j) and B = (bi,j) Output: n × n-matrix C = A·B Method: for all (i, j) with i, j {1, ..., n} for k = 1, ..., n ci,j := ci,j  +  ai,k · bk,j

The following program shows the first iteration of an ISA program for multiplication of two n × n-matrices. The first column of matrix A is input at the left border of the array, the first row of matrix B is input at the upper border.

The first instruction, denoted by a solid arrow symbol, is C,R0:=CW. This instruction reads the contents of the communication register of the western neighbour and writes it simultaneously into the own communication register C and into register R0. In each processor, the contents of register R0 (and later, the contents of R1 and R2) is shown at the top, the contents of the communication register at the bottom. The second instruction is C:=CN. It reads the contents of the communication register of the northern neighbour and writes it into the own communication register.

On completion of the input each processor (i,j) has matrix elements ai,1 and b1,j available and can compute the inner product step ci,j := ci,j + ai,k· bk,j, here for k=1.

This is done by the next two instructions. The third instruction R1:=R0*C multiplies the two values ai,k and bk, j and stores the result (here denoted as di,j). The fourth instruction R2:=R2+R1 accumulates the products ai,k·bk,j in ci,j.

 Enable Java to see an animation of this figure Figure 1: Matrix multiplication step

In the next iteration step, the second column of matrix A is input at the left border of the array, the second row of matrix B is input at the upper border, and so on. After n iteration steps, the product of the two matrices is accumulated as values ci,j in register C2.

Each iteration step consists of a constant number of instructions, thus the entire matrix multiplication takes Θ(n) steps .

In the programming language Laisa the complete program is expressed as follows:

 ```procedure matrix_product(n : integer); begin(n) for k:=1 to n do begin < set CW, C0; 1*; 1* >; { C,R0:=CW } < set CN, C; 1*; 1* >; { C:=CN } < mult R0, C, R1; 1*; 1* >; { R1:=R0*C } < add R2, R1, R2; 1*; 1* >; { R2:=R2+R1 } end end; ```

In the actual implementation of the ISA, a floating point number is represented by 80 bits using 5 registers, each of 16 bit length. Then of course, input of a column or row takes 5 instructions. A floating point multiplication takes 16 instructions, and a floating point addition takes 18 instructions, including a normalization.

Altogether, multiplication of two floating point n × n-matrices takes 44n instructions.

### Transitive closure

The Warshall algorithm computes the transitive closure of the edge relation of a graph with n vertices.

 Warshall algorithm Input: adjacency matrix A = (ai,j) of a graph G with n vertices Output: adjacency matrix A+ of G+ Method: for k = 1, ..., n for all (i, j) with i, j {1, ..., n} ai,j := ai,j  or   ai,k and ak,j

The Warshall algorithm can be mapped efficiently to the ISA in the following way [Lan 88].

Let ai,j be stored in processor (i,j). In order to perform the basic operation for the first iteration step (k = 1), each processor (i,j) needs the following data items:

• ai,j  from the processor itself,
• ai,1  from the first column of the same row, and
• a1,j  from the first row of the same column.

The last two data items can be distributed to the processors by a broadcast in horizontal and vertical direction, respectively.

But, however, in the subsequent iteration steps, ai,k has to move to all processors (i,j). This can be done by a broadcast only for those processors (i,j) where j>k, i.e. where the data item has to move to the right. If j<k, the data item has to move to the left, and this cannot be done by a broadcast. A broadcast operation can only be done from left to right with a constant number of instructions.

The same problem occurs for data items ak,j that are needed in all processors of column j.

In order to be able to perform broadcast operations in each iteration step, a ring shift of the data items ai,j in horizontal as well as in vertical direction is applied after each iteration step. This assures that data item ai,k is in the first column when it its needed in all columns, and data item ak,j is in the first row when it is needed in all rows. Then, these data items can be distributed by broadcast operations.

The following program shows how the data items ai,j are permuted among the processors (i,j). After n such ring shift operations, i.e. at the end of the n iterations of the algorithm, data are again in their original places.

 Enable Java to see an animation of this figure Figure 2: Ring shift in horizontal and vertical direction

The following Laisa program realizes the transitive closure algorithm.

 ```procedure transitive_closure(n : integer); begin(n) for k:=1 to n do begin < set C, R0; 1*; 1* >; { R0:=C } broadcast_row(n, 1*); < set C, R1; 1*; 1* >; { R1:=C } < set R0, C; ; 1* >; { C:=R0 } broadcast_col(n, 1*); < and R1, C, R1; 1*; 1* >; { R1:=R1 and C } < or R0, R1, C; 1*; 1* >; { C:=R0 or R1 } ringshift_row(n, 1, 1*); ringshift_col(n, 1, 1*); end end; ```

 Next:   [Program transformations]   or  H.W. Lang   Hochschule Flensburg   lang@hs-flensburg.de   Impressum   Datenschutz   ©   Created: 17.05.2001   Updated: 20.06.2018 