__Reprint from:__ K. Bromley, S.Y. Kung, E. Swartzlander (eds.): Proc. Int. Conf. on Systolic Arrays, San Diego, 295-304 (1988) / Tech. Report Nr. 8718, Institut für Informatik, University of Kiel (1987)

The instruction systolic array (ISA) is an array processor architecture, which is characterized by a systolic flow of instructions (instead of data as in standard systolic arrays). This paper describes how the well-known Warshall algorithm for computing the transitive closure of a directed graph with *n* nodes can be implemented on an *n*×*n*-ISA. For problem sizes *m* ≤ *n* the time complexity of this implementation is in O(*m*). For larger problem sizes a decomposition technique is given to solve the problem in time O(*m*^{3}/*n*^{2}), whereby the maximum speedup of O(*n*^{2}) is attained. Furthermore, a generalization of the transitive closure algorithm is implemented to solve other path problems, such as the shortest path problem.

Finding the transitive closure *R*^{+} of a binary relation *R* over a finite set *M* is fundamental in computing. The well-known Warshall algorithm [13, 10] solves this problem on a sequential machine in O(*n*^{3}) steps, where *n* = |*M*|.

Several algorithms for parallel computers and especially for systolic arrays, which solve the transitive closure problem or related problems in time O(*n*), are known. Guibas, Kung and Thompson [1] give a systolic algorithm for computing the transitive closure. Systolic solutions for the transitive closure and the shortest path problem are presented by Kung, Lo and Lewis [3]. The problem of finding the connected components of a graph is solved by Hambrusch [2] on a mesh-connected processor array. Robert and Trystram [8] and Rote [9] give systolic arrays for the algebraic path problem, which is a generalization of the transitive closure problem.

In this paper, a parallel implementation of the Warshall algorithm on an instruction systolic array (ISA) is presented. The instruction systolic array has been proposed recently in [5] as a parallel processor architecture which is on the one hand suitable for VLSI, since it is a systolic architecture,and which is on the other hand flexible enough for efficiently executing a large variety of different algorithms belonging to quite different problem classes (see e.g. [4, 5, 6, 11, 12] for ISA programs for matrix transposition, matrix multiplication, sorting, etc.).

The transitive closure problem may be generalized to the __algebraic path problem__. We give an ISA program implementing a generalized closure algorithm for solving the algebraic path problem.

Finally, a decomposition technique is given in order to map arbitrarily large problem instances onto a processor array of fixed size.

Let *M* be a finite set and *R* a binary relation over *M*. The __transitive closure__ *R*^{+} of *R* is defined as the smallest transitive relation over *M* that contains *R*.

In a graph theoretic interpretation, we have a directed graph *G* = (*V*, *E*) with a set of vertices *V* and a set of edges *E* ⊆ *V*×*V*. For simplicity we assume *V* = {1, ..., *n*}, where *n* ∊ ℕ. The __transitive closure__ of *G* is the graph *G*^{+} = (*V*, *E*^{+}), where an edge (*i*, *j*) is in *E*^{+} iff there exists a directed path from *i* to *j*, i.e. there exists a sequence of vertices *u*_{0}, ..., *u _{t}* with

The __adjacency matrix__ of a graph *G* is a boolean *n*×*n*-matrix *A* = (*a*_{i, j}) whose non-zero entries represent the edges of *G*: *a*_{i, j} = 1 if (*i*, *j*) ∊ E and *a*_{i, j} = 0 otherwise.

Given the adjacency matrix *A* of a graph *G* the Warshall algorithm computes the adjacency matrix *A*^{+} of the transitive closure of *G* by the following method:

FOR k:=1 TO n DO FOR i:=1 TO n DO FOR j:=1 TO n DO a[i,j]:=a[i,j] OR a[i,k] AND a[k,j]

The algorithm constructs a sequence *A*^{(0)}, ..., *A*^{(n)} of adjacency matrices, where *A*^{(0)} = *A* and each *A*^{(k)} represents all paths of *G* containing no intermediate vertices (excluding the endpoints) higher than *k*. For *k* > 0, *A*^{(k)} is obtained from *A*^{(k-1)} by looking at all pairs of vertices (*i*, *j*) if there is already a path from *i* to *j* represented in *A*^{(k-1)}, or if there are paths represented in *A*^{(k-1)} going from *i* to *k* and from *k* to *j*. If so, then (and only then) there exists a path from *i* to *j* with no intermediate vertices higher than *k*, and *a*^{(k)}_{i, j} is set to 1. Since no path can contain a vertex higher than *n*, *A*^{(n)} = *A*^{+}.

If the closure is to be reflexive, the statement

FOR i:=1 TO n DO a[i,i]:=1

has to be added.

Our computer model is the instruction systolic array (ISA) [5]. In an ISA, the instructions are pumped through a processor array in a systolic way, instead of data as in standard systolic arrays (Figure 1). This allows the execution of different algorithms on the same processor array, in contrast to systolic arrays which are special purpose architectures. Furthermore, the instruction stream is combined with an orthogonal stream of selector bits. This allows a very flexible addressing of subsets of the processors. A processor executes its instruction only if its selector bit is "1", otherwise it executes a "no operation" instruction which leaves all data register contents unchanged.

The basic architecture of the ISA is a mesh-connected *n*×*n* array of identical processors (Figure 2), which is synchronized by a global clock. The processors can execute instructions from some instruction set, the execution time is assumed to be the same for all instructions.

Each processor has some data registers including a designated communication register. The communication register of a processor can be read by its four direct neighbors, but each processor can write only into its own communication register. In order to avoid read/write conflicts, during an instruction cycle reading is done before writing.

For input and output of data the open ended processor links at the boundary of the array are used.

The processors do not have their own control stores but are supplied with instructions from outside. Each processor has an instruction register. At the beginning of each instruction cycle, each processor fetches the instruction from the instruction register of its upper neighbor. Since this is done sychronously, rows of instructions are shifted through the processor array from the top to the bottom. The processors in the top row are supplied with instructions from an external memory.

In an analogous way columns of selector bits are shifted through the processor array from left to right.

An ISA program is a sequence *p*^{(1)}, ..., *p*^{(r)} of n-tuples over the instruction set *I* together with a sequence *s*^{(1)}, ..., *s*^{(r)} of n-tuples over {0,1}. For every *i*, *j* ≤ *n* and *t* ≤ *r* the instruction row *p*^{(t)} enters the *i*-th row of the ISA at time *t* + *i* - 1 and the column of selector bits *s*^{(t)} enters the *j*-th column of the ISA at time *t* + *j* - 1. This means, the instruction executed by processor (*i*, *j*) at time *t* is p_{j}^{(t-i+1)} if s_{i}^{(t-j+1)} = 1 and "no operation" otherwise. Execution of the program stops when the last instruction row has entered the first row of the ISA.

Rather than suggested by this definition, ISA programs may be viewed as "diamond-shaped", consisting of one or more diagonals of instructions and selector bits, respectively. This is the case because each diagonal of instructions is combined with its corresponding diagonal of selector bits while it is shifted through the processor array, as shown in Figure 3. The *i*-th entry in a selector diagonal determines if the instructions of the corresponding instruction diagonal will be executed in row *i* of the ISA or not.

The __time__ to execute an ISA program is defined as its length. The __period__ of an ISA program is defined as the number of diagonals it consists of, excluding no-operation diagonals at the beginning or the end.

Let us first consider some program fragments for the ISA that will later on serve as building blocks for the transitive closure algorithm.

All examples in the following are for the case *n* = 4; the generalization for arbitrary *n* is straightforward.

**Input**

By the program diagonal of Figure 4 a data item is input by processor (2,1) and then routed to processor (2,3). The instruction used has the meaning "read the contents of the communication register of the left neighbor and write it into the own communication register". It is assumed that there are input registers at the upper and the left boundary of the processor array instead of the communication registers of the non-existent processors (0, *j*) and (*i*, 0) (*i*, *j* = 1, ..., *n*). Empty boxes denote the no-operation instruction and the selector bit "0", respectively. By combining program diagonals of this kind a program for input of an *n*×*n* matrix of data items across the left boundary of the ISA is achieved (Figure 5).

**Broadcast**

Figure 6 shows a program fragment for "broadcasting" the contents of the communication register of processor (1,1) to all other processors of that row. It is an interesting fact that, although there is no common bus connecting the processors of a row, this is possible with period 1. Similarly, broadcasting from the first row to the rows below is also possible. However, there is no broadcasting in the opposite directions, i.e. from right to left and from the bottom to the top. Routing data across the array in these directions takes period *n*.

**Ring shift**

Simultaneous execution of a "read right" and "read left" instruction in adjacent processors results in an exchange of the contents of the two communication registers. In the example of Figure 7 processors (1,1) and (1,2) exchange their data.

By the program fragment of Figure 8 a ring shift among the communication register contents of the first row of the ISA is achieved. Again it seems surprising that, although there are no wrap-around connections, such a ring shift can be done with constant period. The analogous program for a vertical ring shift is shown in Figure 9; the meaning of the instruction symbols used so far is summarized in Figure 10.

As in the case of broadcasting, ring shifting in the opposite directions can only be done with period Ω(*n*). Broadcasting and ring shifting are the key operations in the following program for the computation of the transitive closure of an adjacency matrix.

**Transitive closure**

Figure 11 shows the ISA program for computing the transitive closure *A*^{+} of a 4×4 adjacency matrix *A* = (*a*_{i, j}) that is stored in the communication registers of the processors.

By the first instruction diagonal the contents of the communication registers C are stored in data registers A. Then, by the second and third instruction diagonal, the contents of the communication registers of the first row of the processor array, i.e. the *a*_{1, j}, *j* = 1, ..., *n*, are broadcast to the subsequent rows and stored intermediately in data registers V. By instruction diagonal 4 the contents of the A-registers of the first column, i.e. the *a*_{i, 1}, *i* = 1, ..., *n*, are loaded into the communication registers and broadcast to all other columns of the processor array.

Now each processor has *a*_{i, j}, *a*_{i, 1} and *a*_{1, j} available (in its A-, C- and V-register, respectively) and can compute the first iteration step of the Warshall algorithm, namely

a[i,j]:=a[i,j] OR a[i,1] AND a[1,j]

This is done by instruction diagonals 5 and 6.

Before computing the second iteration step, the second row and the second column of *A* (the *a*_{2, j} and the *a*_{i, 2} , *i*, *j* = 1, ..., *n*) have to be distributed to all processors of the array.

To do this, first a ring shift in all rows is performed (instruction diagonals 7 and 8), in order to move matrix column *a*_{i, 2} to the first column of the processor array. This is followed by a ring shift in all columns to move row *a*_{2, j} to the first row of the processor array (instruction diagonals 9 and 10).

Now the array is prepared for computing the second iteration step of the Warshall algorithm using the same program. Altogether, *n* repetitions of instruction diagonals 1 through 10 are necessary to compute the transitive closure *A*^{+}, which will finally be stored in the communication registers of the processor array.

The period of the program is 10*n*, its time is 12*n*-2 (additional *n*-1 rows of no-operation instructions at the end of the program are necessary to forward the last "proper" instruction row to the last row of the ISA).

In order to compute the transitive and reflexive closure *A**, the program of Figure 12 has to precede the above program.

The problem of computing the transitive closure of a graph may be viewed as a special case of the algebraic path problem.

Let *G* = (*V*, *E*) be a graph with *V* = {1, ..., *n*} and *E* ⊆ *V*×*V*. We denote by *E** the set of all strings over *E* and by **P**(*E**) the power set of *E**, i.e. the family of all subsets of *E**.

The structure (**P**(*E**), ∪, ·, Ø, {ε}), where ∪ denotes set union, · set concatenation, Ø the empty set and ε the empty string, forms a __closed semiring__, since it satisfies the axioms as e.g. stated in [1]:

1) | (P(E*), ∪, Ø) is a monoid |

(P(E*), ·, {ε}) is a monoid | |

Ø is an annihilator, i.e. X·Ø = Ø·X = Ø for all X ∊ P(E*) | |

2) | ∪ is commutative and idempotent |

3) | · distributes over ∪ |

4) | The countable union of sets X_{i} in P(E*) exists and is unique. Moreover, associativity, commutativity and idempotence apply to infinite as well as finite unions |

5) | · distributes over countably infinite unions as well as finite ones |

**P**(*E**) forms a __free__ closed semiring over *E*′ = {{*e*} | *e* ∊ *E*}, since each mapping *h* from *E*′ into an arbitrary closed semiring (*S*, +, ·, 0, 1) can be extended to a homomorphism *h*′ from **P**(*E**) into *S*. This extension is unique if we choose *h*′(Ø) = 0 and *h*′({ε}) = 1.

We denote by *P*_{i, j} the set of all paths of length greater than 0 between two vertices *i* and *j* of *G*. The sets *P*_{i, j} may be regarded as elements of **P**(*E**). We have indeed defined paths as finite sequences of vertices, but we can identify an arbitrary path *u*_{0}, ..., *u*_{t} where *u*_{r} ∊ *V*, *t* > 0, with the string (*u*_{0}, *u*_{1})(*u*_{1}, *u*_{2})...(*u*_{t-1}, *u*_{t}) over *E*. The path of length 0 between a vertex *v* and itself is identified with the empty string ε, for all *v* ∊ *V*.

The following algorithm computes the sets *P*_{i, j} for all pairs of vertices *i*, *j*. The sets *P*[0, *i*, *j*] are initialized with {(*i*, *j*)}, if (*i*, *j*) ∊ *E*, and with Ø otherwise. Output are the sets *P*[*n*, *i*, *j*] = *P*_{i, j} .

__Algorithm PATHS__

FOR k:=1 TO n DO FOR ALL i,j IN {1,...,n} DO P[k,i,j]:=P[k-1,i,j] ∪ P[k-1,i,k] · P[k-1,k,k]* · P[k-1,k,j]

The operation * is the Kleene closure, defined by *X** = ∪*X*^{i} (*i* = 0, 1, 2, ...) for all *X* ∊ **P**(*E**).

The algorithm constructs sets *P*[*k*, *i*, *j*] of paths between vertices *i* and *j* with no intermediate vertices higher than *k*. *P*[*k*, *i*, *j*] is obtained from *P*[*k*-1, *i*, *j*] by adding to *P*[*k*-1, *i*, *j*] the set of all paths that go from *i* to *k*, then from *k* to *k* (possibly of length 0), and then from *k* to *j*, where in each of these subpaths no vertex higher than *k*-1 occurs. Obviously, this set is obtained by the concatenation of *P*[*k*-1, *i*, *k*], *P*[*k*-1, *k*, *k*]*, and *P*[*k*-1, *k*, *j*].

If paths of length 0 are to be included, then each *P*[0, *i*, *i*] has to be initialized with {ε, (*i*, *i*)} if (*i*, *i*) ∊ *E* and with {ε} otherwise.

Algorithm PATHS is very similar to Kleene's algorithm for determining the regular set accepted by a finite automaton, to the algorithm for solving path problems given in [1], or to the Warshall algorithm.

In fact, by appropriately choosing a closed semiring (*S*, +, ·, 0, 1) and a mapping *h* from *E*′ into *S*, called __labeling function__, algorithm PATHS can be modified in order to solve a large variety of apparently different problems. This is done by simply initializing the variables *a*[0, *i*, *j*] with *h*({(*i*, *j*)}) if (*i*, *j*) is an edge and with 0 otherwise, and by then reinterpreting the operations ∪, · and * by the corresponding operations +, · and * of *S*. In the following, we write *h*(*i*, *j*) instead of *h*({(*i*, *j*)}).

Thus, the modified algorithm operates on the homomorphic images of the original sets *P*[*k*, *i*, *j*] under the unique homomorphism *h*′ determined by *h* and computes, for all pairs of vertices (*i*, *j*),

__Generalized closure algorithm__

FOR k:=1 TO n DO FOR ALL i,j IN {1,...,n] DO a[k,i,j]:=a[k-1,i,j] + a[k-1,i,k] · a[k-1,k,k]* · a[k-1,k,j]

The problem of determining the values *h*′(*P*_{i, j}) for all *i*, *j* in *V* is termed the __algebraic path problem__.

__Example 1:__

With the closed semiring ({0,1}, OR, AND, 0, 1) and the labeling function *h* with *h*(*i*,*j*) = 1 for all (*i*,*j*) ∊ *E* the algorithm computes the transitive closure of *G*. The term *a*[*k*-1, *k*, *k*]* can be omitted, since 0* = 1* = 1, the neutral element with respect to the AND operation.

__Example 2:__

With the closed semiring (**R**, min, +, ∞, 0), where **R** is the set of the nonnegative real numbers including ∞, and a function *h* that assigns a length to each edge, the algorithm computes the length of the shortest paths between all vertices *i*, *j*. Again the term *a*[*k*-1, *k*, *k*]* can be omitted, since it is always 0.

The ISA program of Figure 13 realizes the generalized closure algorithm, provided the elements of the underlying closed semiring can be represented and the operations +, · and * can be implemented properly.

__Example 3:__

This example will be used in the next section. As closed semiring we take (**B**^{n×n}, +, ·, **O**, **I**), where **B**^{n×n} is the set of all *n*×*n* matrices over the closed semiring ({0,1}, OR, AND, 0, 1), + is boolean matrix addition (elementwise OR in this case) and · is boolean matrix multiplication (with the elementary operations OR and AND). **O** is the matrix with all elements 0 and **I** is the identity matrix whose elements are all 0 except of the elements in the main diagonal which are 1.

The operation * on (**B**^{n×n}, +, ·, **O**, **I**) has been defined as

*A** = **I** + *A* + *A*·*A* + ... for all *A* ∊ **B**^{n×n}

If *A* is considered as an adjacency matrix of a graph *G*, then the right side of the above equation is the sum of the adjacency matrices representing all paths of *G* of length 0, 1, 2, ..., which is nothing else than the transitive and reflexive closure of *A*. An example of a labeling function with elements from **B**^{n×n} will be given in the next section.

We have to consider the case where the problem size, which will be denoted by *m* from now on, does not match the fixed size *n* of the processor array, may it be larger or smaller.

If the problem is smaller, the ISA program can easily be adapted so that it uses a region in the upper left corner of the processor array only (Figure 14). As a consequence, the execution time drops to O(*m*).

If the problem size *m* is larger than *n* the problem has to be partitioned into subproblems of size ≤ *n*. The following method was outlined in [9].

For simplicity, we assume *m* = *cn*, *c* ∊ {2, 3, ...}. Now the vertex set *V* of *G* can be partitioned into *P* = {*V*_{1}, *V*_{2}, ..., *V*_{c}}, where *V*_{1} = {1, ..., *n*}, *V*_{2} = {*n*+1, ..., 2*n*}, ..., *V*_{c} = {(*c*-1)*n*+1, ..., *cn*}.

Consider the quotient graph *G*/*P* = (*P*, *E*/*P*), where *E*/*P* = { (*V _{i}*,

Since the edge labels are taken from a closed semiring, the generalized closure algorithm is applicable. The operations occuring in this case are multiplication, addition and transitive and reflexive closure of boolean *n*×*n* matrices.

**Theorem**

Let^{ }*G*/*P* and *h* be defined as above. Then the matrices *S*[*c*, *i*, *j*] computed by the generalized closure algorithm exactly represent the paths between the vertices of *V _{i}* and those of

__Proof:__ We show directly by induction on *k* that each path in will be represented by an entry in one of the matrices *S*[*c*, *i*, *j*].

Let *u*_{0}, ..., *u _{r}* (r > 1) be a path in

If *r* = 1 then (*u*_{0}, *u*_{1}) is represented by an entry in *S*[1, *i*, *j*].

Thus, by the assignment

S[1,i,j] := S[0,i,j] + S[0,i,1] · S[0,1,1]* · S[0,1,j]

*u*_{0}, ..., *u _{r}* will be represented by an entry in

Now let any path containing only intermediate vertices belonging to *V*_{1} ∪ ... ∪ *V _{k}* be represented by an entry in some

If *r* = 1 or there occurs no vertex belonging to *V _{k}* in

Thus, by the assignment

S[k,i,j] := S[k-1,i,j] + S[k-1,i,k] · S[k-1,k,k]* · S[k-1,k,j]

*u*_{0}, ..., *u _{r}* will be represented by an entry in

Each path in *G* will eventually be represented by an entry in one of the matrices *S*[*c*, *i*, *j*]. Conversely, it can be shown by a backward induction that each entry in the matrices *S*[*c*, *i*, *j*] represents a path of *G*.

The time complexity of the partitioned transitive closure algorithm is in O(*c*^{3}*n*), since the sequential closure algorithm consitsts of *c*^{3} steps, each involving input and output of operands to the ISA, and multiplication, addition and transitive and reflexive closure of *n*×*n* matrices, which can be done on the ISA in time O(*n*) (see [7] for a program for matrix multiplication). Since *c*^{3}*n* = *m*^{3}/*n*^{2}, the optimal speedup of O(*n*^{2}) is attained.

We have presented a program implementing the Warshall algorithm on the instruction systolic array. The simplicity of this implementation emphasizes the practical significance of the ISA as a flexible array processor architecture. Especially the good routing capabilities of the ISA (broadcast and ring shift with constant period) provide for elegant and efficient solutions for the problem of interprocessor communication.

Moreover, we have given an ISA program for solving the algebraic path problem.

Finally, we have shown how the transitive closure problem can be decomposed into subproblems that can be solved on an ISA of fixed size. It should be mentioned that the algebraic path problem can of course be decomposed in an analogous way.

All programs can also very easily be modified in order to run on a simpler variant of the ISA, the single instruction systolic array (SISA) [7].

[1] A.V. Aho, J.E. Hopcroft, J.D. Ullman: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA (1974)

[2] L.J. Guibas, H.T. Kung, C.D. Thompson: Direct VLSI Implementation of Combinatorial Algorithms. Proc. Caltech Conf. on VLSI, Californian Institute of Technology, Pasadena, 509-525 (1979)

[3] S.E. Hambrusch: VLSI Algorithms for the Connected Component Problem. SIAM J. Computing, Vol. 12, Nr. 2 (1983)

[4] S.-Y. Kung, S.C. Lo, P.S. Lewis: Optimal Systolic Design for the Transitive Closure and the Shortest Path Problems. IEEE Trans. on Comput., Vol. C-36, 5, 603-614 (1987)

[5] M. Kunde, H.-W. Lang, M. Schimmler, H. Schmeck, H. Schröder: The Instruction Systolic Array and its Relation to Other Models of Parallel Computers. Proc. Int. Conf. Parallel Computing 85, North-Holland, Amsterdam (1986)

[6] H.-W. Lang: The Instruction Systolic Array - a Parallel Architecture for VLSI. Integration, the VLSI journal 4, 65-74 (1986)

[7] H.-W. Lang: ISA and SISA: Two Variants of a General Purpose Systolic Array Architecture. Proc. Second Int. Conf. on Supercomputing, Vol. 1, 460-465 (1987)

[8] Y. Robert, D. Trystram: Systolic Solution of the Algebraic Path Problem. In: W. Moore, A. McCabe, R. Urquhart (eds.): Systolic Arrays. Adam Hilger, Bristol, 171-180 (1986)

[9] G. Rote: A Systolic Array Algorithm for the Algebraic Path Problem (Shortest Paths; Matrix Inversion). Computing 34, 191-219 (1985)

[10] B. Roy: Transitivite et Connexite. C.R. Acad. Sci. Paris Ser. A-B 249, 216-218 (1959)

[11] H. Schmeck: A Comparison-Based Instruction Systolic Array. In: M. Cosnard, Y. Robert, P. Quinton, M. Tchuente (eds.): Parallel Algorithms and Architectures, North-Holland, Amsterdam, 281-292 (1986)

[12] H. Schröder, E.V. Krishnamurthy: Generalized Matrix Inversion Using Instruction Systolic Arrays. Technical Report, Computer Science Lab., Australian National University, Canberra (1987)

[13] S. Warshall: A Theorem on Boolean Matrices. J. Assoc. Comput. Mach. 9, 11-12 (1962)

[14] U. Zimmermann: Linear and Combinatorial Optimization in Ordered Algebraic Structures. Ann. Discrete. Math. 10, 1-380 (1981)