Sorting on twodimensional processor arraysSorting algorithm of Schnorr and Shamir 
The algorithm 3nsort of Schnorr and Shamir [SchSh 86] is more complicated than e.g. rotatesort or 4way mergesort. However, with a time complexity of 3n + o(n) it is optimal, since 3n is a lower bound for sorting on a twodimensional processor array.
The n×narray is decomposed into vertical slices, horizontal slices, and blocks. A vertical slice is an n×n^{3/4}subarray, a horizontal slice is an n^{3/4}×nsubarray, and a block is an n^{3/4}×n^{3/4}subarray (Figure 1).
 
Figure 1: Vertical slices (a), horizontal slices (b) and blocks (c)  
The entire array as well as subarrays (blocks, slices) are sorted in snakelike order.
The algorithm of Schnorr/Shamir uses the following operation unshuffle.
A kway unshuffle operation is a permutation that corresponds to dealing n cards to k players (k is a divisor of n).
Example: Permutation 4way unshuffle of the numbers 0, ..., 11
0  1  2  3  4  5  6  7  8  9  10  11 
0  4  8  1  5  9  2  6  10  3  7  11 
Player 1 receives cards 0, 4 and 8, player 2 receives cards 1, 5 and 9 etc.
In algorithm 3nsort an n^{1/4}way unshuffle of the elements of the rows is performed.
Algorithm 3nsort  
Input:  unsorted n×narray of data 
Output:  the sorted n×narray 
Method: 

Sorting in alternating direction means to sort each row i with i even into ascending order, and each row i with i odd into descending order (i {0, ..., n1}).
The correctness of 3nsort is shown again (in a slightly different way as in [SchSh 86]) using the 01principle. In the following figures zeroes are drawn white and ones gray.
Definition: A row is called clean, if it consists of zeroes or ones only, otherwise it is called dirty. A dirty row in a sorted array is called incomplete.
Definition: The number of ones in an r × ssubarray is called the weight of the subarray.
In particular, this definition refers to columns and blocks.
Definition: A maximal connected region along the sorting order that starts with a one and ends with a zero is called an unsorted zone.
A sequence containing an unsorted zone starts with some 0's, then comes a mix of 1's and 0's, and then may come 1's.
After sorting the blocks each block has at most one incomplete row (Figure 2a).
The operation n^{1/4}way unshuffle distributes the columns of a block to all n^{1/4} blocks of the same horizontal slice in a roundrobin way. If a block has an incomplete row, some blocks receive a one more from this block than others. Altogether, a block can receive at most n^{1/4} ones more than any other, i.e. the weights of any two blocks in a horizontal slice can differ by at most n^{1/4} (Figure 2b).
After sorting the blocks again each block contains at most one incomplete row (Figure 2c).
 
Figure 2: Situations during the execution of the algorithm  
After sorting the columns each vertical slice has at most n^{1/4} dirty rows (the incomplete rows of its n^{1/4} blocks).
Moreover, the weights of any two vertical slices differ by at most n^{1/4} · n^{1/4} = n^{1/2} (Figure 2d).
After sorting the vertical slices all vertical slices have almost the same number of complete 1rows: the difference can be at most 1. Since vertical slices have a width of n^{3/4}, a weight difference of n^{1/2} can contribute to at most one additional complete 1row.
The region of length n^{1/2} on the snake in each vertical slice that can contain these additional ones is called the critical region (drawn in red in Figure 3a, not in correct scale).
 
Figure 3: Critical regions in vertical slices (a) before Step 6 and (b) after Step 6  
In the entire array there are at most 2 dirty rows (Figure 2e).
In Step 6 the two last dirty rows are sorted in alternating direction. Possibly still unsorted are the at most n^{1/4} · n^{1/2} = n^{3/4} elements from the critical regions (Figure 3b / Figure 2f). These elements form an unsorted zone of length at most n^{3/4}.
In Step 7 the unsorted zone of length at most n^{3/4} is sorted according to the snake.
Due to the 01principle, the algorithm sorts every arbitrary data, because it sorts all 01sequences.
An analysis of the algorithm yields the following:
Step 1:  sorting the blocks  O(n^{3/4})  
Step 2:  unshuffle along the rows  n  
Step 3:  sorting the blocks  O(n^{3/4})  
Step 4:  sorting the columns  n  
Step 5:  sorting the vertical slices  O(n^{3/4})  
Step 6:  sorting the rows in alternating direction  n  
Step 7:  n^{3/4} steps of oddeven transposition sort  O(n^{3/4})_{ }  
altogether:  3n + O(n^{3/4}) 
The blocks can be sorted using any linear sorting algorithm, e.g. 4way mergesort.
In Step 5, the vertical slices can be sorted in time O(n^{3/4}), because they contain a region of only n^{1/4} dirty rows (e.g. by sorting the blocks and subsequently sorting the blocks vertically overlapping by n^{1/4} rows).
The following simulation shows the execution of algorithm 3nsort with a 256×256array of zeroes and ones. For better illustration some operations are shown sequentially that would be performed in parallel on a twodimensional processor array.
The algorithm 3nsort of Schnorr/Shamir for sorting an n×narray has a time complexity of 3n + O(n^{3/4}). It is asymptotically optimal as well as regarding the constant, since 3n is a lower bound for sorting on an n×narray. The algorithm of Schnorr/Shamir is simpler than the algorithm of Thompson/Kung [TK 77] which is also optimal with a time complexity of 3n + O(n^{2/3}·log(n)).
[SchSh 86]  C.P. Schnorr, A. Shamir: An Optimal Sorting Algorithm for MeshConnected Computers. Proc. of the 18th ACM Symposium on Theory of Computing, 255261 (1986)  
[TK 77]  C.D. Thompson, H.T. Kung: Sorting on a MeshConnected Parallel Computer. Communications of the ACM, 20, 4, 263271 (1977) 
Next: 
H.W. Lang Hochschule Flensburg lang@hsflensburg.de Impressum © Created: 23.05.2001 Updated: 05.06.2016 