# Proving 50-Year-Old Sorting Networks Optimal: Part 2

In the previous post I gave an overview of sorting networks, the problem of finding minimal-size sorting networks and the state of the art at the time I started looking into this. Here I will begin describing my method for finding minimal-size sorting networks. With that method I was able to prove for the first time that the smallest known 11 and 12-channel sorting networks are in fact of minimal size.

I wrote up a detailed description including proofs as part of my paper on this. For this blog, I want to give a more concise overview that focuses on the main ideas.

If you haven’t read the previous post yet, I recommend that you do so before continuing here, as I will build on the the method by M. Codish, L. Cruz-Filipe, M. Frank, P. Schneider-Kamp [1] that I described there.

## Partial sorting networks

Previously I mentioned that I came up with my method by combining a computer search—as used before—with the idea behind Van Voorhis’s bound which so far was only used separately.

Before taking a detailed look at Van Voorhis’s bound, I want to revisit the generate-and-prune approach by Codish et al. [1] and present some of the sorting network properties it uses—but in a slightly different way. Initially, I did this to gain a better understanding of how generate-and-prune behaves, but later it turned out to be exactly what I needed to incorporate the idea behind Van Voorhis’s bound into a similar search procedure.

For the generate-and-prune approach, we start building a sorting network from the empty comparator network by appending one comparator at a time. For each sorting network $c$ we encounter, we then use the set of possible zero-one output sequences $\text{outputs}(c)$ to compare candidates and exclude some of them. At no point in the algorithm, though, do we care about what $c$ looks like apart from $\text{outputs}(c)$.^{1}

I thought it would be a good idea to make this explicit. For that I introduced the notion of a *partial* sorting network on a set of zero-one sequences. As we will mostly work with zero-one sequences, I will use the term *sequence set* to refer to a set of such sequences, all having the same length.

Given a sequence set $X$, a comparator network $c$ is a partial sorting network on $X$ if and only if for every sequence $x \in X$, the network $c$ produces a sorted output when applied to $x$.^{2} For the sequence set $B^n$ of all length-$n$ sequences, a partial sorting network on $B^n$ is exactly the same as an $n$-channel sorting network. On the other end, any comparator network is a partial sorting network on the empty set $\emptyset$.

As an example for a partial sorting network in between those extremes consider this network:

The sequence $(0, 0, 1, 0)$ will pass through that network unchanged and is not sorted, so we don’t have a sorting network. But there is a set of 12 sequences, depicted below, that all will be sorted by this network, making it a partial sorting network on that sequence set.

We can also define the minimal size for partial sorting networks. Where $s(n)$ is the size of the smallest $n$-channel sorting network, we use $s(X)$ to denote the size of the smallest partial sorting network on $X$. For example $s(\emptyset) = 0$ and $s(B^n) = s(n)$. We use the same syntax $s(c)$ to refer to the size of a specific sorting network $c$.

So when the argument is an integer, $s(n)$ is the minimal size among $n$-channel sorting networks, when the argument is a sequence set $s(X)$ is the minimal size among partial sorting networks on $X$ and when the argument is a sorting network $s(c)$ is its size.

If we want to check whether a comparator network is a partial sorting network or want to construct partial sorting networks incrementally, it is again useful to compute the sequence set of possible outputs. Now this depends on the sequence set of inputs we consider. Given the inputs $X$ and a comparator network $c$, I use $X^c$ to denote the sequence set of possible outputs of $c$ when restricting the input to the sequences in $X$.

That choice of notation might seem a bit surprising at first, but it mirrors the fact that taking the outputs of one comparator network as inputs for another is the same as concatenating both networks. As we write the operations of a network from left to right, we have ${(X^c)}^d = X^{cd}$ for all sequence sets $X$ and all comparator networks $c$ and $d$.^{3}

## A Useful Recurrence

After defining partial sorting networks and extending our definition of minimal size to them, we can give a very useful recurrence for $s(X)$, the minimal size of a partial sorting network on the sequence set $X$. The following equation holds, as long as $s(X) > 0$, i.e. as long as there is no network using only unconditional exchanges that can sort all sequences in $X$.

$s(X) = 1 + \min_{1 \le i < j \le n} s(X^{[i, j]})$

Here $[i, j]$ ranges over all comparators that can be present in a standard form network. In the previous post we saw that any sorting network can be converted into standard form without changing the number of comparators. The same is almost true for any comparator network. We can orient all comparators $[i, j]$ so that $i < j$, but we cannot always get rid of all unconditional exchanges. What we can do, using the same rewrite rules as before, is to make sure that all comparators precede all unconditional exchanges. We’re going to relax our definition of standard form to allow such trailing unconditional exchanges as long as there is no equivalent comparator network without. This doesn’t change what standard form means when considering sorting networks, but extends its definition to cover all comparator networks and thus also partial sorting networks.

Now let $c$ be a standard form, minimal size partial sorting network on $X$. As conversion to standard form doesn’t change the size, there is such a minimal size network in standard form. We are still assuming $s(X) > 0$, so $c$ has at least one comparator. This allows us to write $c$ as $[k, l]c'$ where $[k, l]$ is the first comparator of $c$ and $c'$ is the remaining suffix.

We also get $s(X) = s(c) = 1 + s(c') = 1 + s(X^{[k, l]})$. To see that this holds, we need to convince ourselves that $c'$ is a minimal size partial sorting network on $X^{[k, l]}$. It is a partial sorting network on $X^{[k, l]}$ as we know that $X^c$ is sorted so $X^{[k, l]c'} = (X^{[k, l]})^{c'}$ is also sorted. If there would be a smaller partial sorting network $d$ on $X^{[k, l]}$ we could prepend $[k, l]$ and construct a partial sorting network on $X$ that is smaller than $c$, but $c$ is of minimal size for $X$, so $c'$ must be of minimal size for $X^{[k, l]}$.

We also have $s(X) \le 1 + s(X^{[i, j]})$ for all $i, j$. Again, for a smaller sorting network $d$ on $X^{[i, j]}$ we could prepend $[i, j]$ and construct a partial sorting network on $X$ that is smaller than $c$. This alone gives us $s(X) \le 1 + \min_{1 \le i < j \le n} s(X^{[i, j]}).$ If we then consider that among the choices for $[i, j]$ we will find $[k, l]$ for which we have $s(X) = 1 + s(X^{[k, l]})$, we can turn the inequality to an equality and get the recurrence from above.

## Recursive Computation

Given such a recurrence, a natural question is to ask whether we can use it to compute $s(X)$. For now we will ignore any practical concerns like run-time or memory consumption, but even then, that recurrence alone doesn’t quite allow us to compute $s(X)$.

There are two obstacles that we need to overcome: First, the recurrence assumes $s(X) > 0$, so we need a test for that. By definition we have $s(X) = 0$ exactly when $X$ can be sorted using only unconditional exchanges or equivalently by applying a fixed permutation to all sequences in $X$. If that is the case, we can find such a permutation by defining the *position weight* $w(i)$ as the number of sequences $x \in X$ for which $x_i = 1$, i.e. the number of sequences that have a one in the given position. Any permutation $\sigma$ which rearranges the positions such that their weights $w(i)$ are non-decreasing will sort all sequences in a set $X$ with $s(X) = 0$. We can apply this procedure to any set $X$, and then check whether it indeed contains only sorted sequences.^{4}

This gives us a base-case for a recursive computation, but there is a second obstacle: we might run into a cycle when applying the recurrence, causing infinite recursion.^{5} Whenever the recursion ends up in a cycle, we could remove the comparators on that cycle and still get the same set of outputs. This means anything that contains a cycle cannot have minimal size, so excluding those choices from the recurrence would not change the result.

It turns out that we can avoid cycles, by excluding so called *redundant* comparators.^{6} A comparator is redundant with respect to a sequence set $X$ if applying it to $X$ either leaves $X$ unchanged or is equivalent to applying an unconditional exchange to $X$. That definition gives us a direct argument for why such a comparator cannot be part of a minimal size partial sorting network: we can either remove it or replace it with the equivalent exchange, reducing the size by one in either case. I’m calling the comparators $[i, j]$ with $i < j$ that are not redundant with respect to $X$ the successors of $X$ and denote them by $\mathcal U(X)$.

This gives us an improved recurrence $s(X) = 1 + \min_{[i, j] \in\, \mathcal U(X)} s(X^{[i, j]}).$ To me it seemed intuitively true that this is enough to avoid any cycles and this was also backed up by experimental evidence from implementing this. My intuition was that any non-redundant comparator affects at least one pair of sequences in such a way that they both get closer to being in the same order. For a more formal statement and proof of this see Definition 39 and Lemma 40 in my paper.

This removes the last obstacle and—in theory—we can now compute $s(X)$ via $s(X) = \begin{cases} 1 + \displaystyle\min_{[i, j] \in\, \mathcal U(X)}^{\mathstrut} s(X^{[i, j]}) & \text{if } s(X) > 0 \\ 0 \phantom{\displaystyle\min_{\mathstrut}^{\mathstrut}} &\text{otherwise.} \end{cases}$

## Outlook

In practice though, using a direct implementation of this recursive formulation is not at all effective at computing $s(X)$. Apart from avoiding redundant comparators, this explores the same tree that we already saw in the last post. The difference is that this does a depth first search instead of a breadth first search, which is the reason why we needed to avoid cycles in the first place. Unlike a depth first search, the breadth first search does not explore candidate networks in order of increasing size, so it cannot terminate as soon as the first complete (partial) sorting network is found, but has to explore the whole tree. Additionally we already saw that the tree contains a lot of redundant subtrees, as exploited in the prune step of the generate-and-prune method.

So what did we gain from doing this and why did I explore this in the first place? The answer to that is: dynamic programming.

The final recursive formula shows that $s(X)$ can be computed from the values of several $s(Y_0), \ldots, S(Y_k)$, where each $Y_i$ is smaller in some sense. In the context of dynamic programming, a problem with that property is said to have *optimal substructure*. In general, dynamic programming is a—not very descriptive—umbrella term for exploiting *overlap* between the subproblems of a problem that has optimal substructure. A well-known application of that idea, arguably the simplest, is memoization, which caches the results of all subproblem computed so far and reuses that result if the same subproblem is encountered again.

Dynamic programming can be much more though: we can change what it means for subproblems to overlap, we do not have to use depth first search, we can adapt the exploration order depending on already computed subproblems, we can compute sound approximations of subproblems and iteratively refine those, etc.

This gives us a framework that allows us to combine the idea behind Van Voorhis’s bound with some of the ideas behind the pruning done by the generate-and-prune method. Exploring this in more detail will be part of the next posts in this series. I’m not sure how much I will cover on this blog and in what time frame—that depends on how much time I can find for writing—but you can already find all the details in the paper. Also, let me know if there is something specific you would like to see covered.

## References

*Journal of Computer and System Sciences*82, 3 (May 2016), 551–563.↩︎↩︎

*The Art of Computer Programming: Volume 3: Sorting and Searching*(2nd ed.). Addison-Wesley.↩︎

We also need to keep track of $c$’s length, but as we generate candidates of stepwise increasing length, that happens implicitly.↩︎

In the last post I used uppercase letters to denote sorting networks, following the paper by Codish et al. From this post on I will use lowercase letters, as I make frequent use of both sorting networks and sequence sets, wanting to distinguish them. Using uppercase letters for sequences sets felt more natural to me.↩︎

This also matches the conventions often used in computational group theory where (semi-)group actions are from the right and written as exponentials.↩︎

If we are concerned about the performance of these checks, we can also use the fact that any sequence set $X$ with $s(X) = 0$ cannot contain two distinct sequences with the same number of ones—any permutation will leave one of the unsorted. There are only $n+1$ possible values for the number of ones in a sequence, so by the pigeonhole principle when $|X| > n + 1$ we have two such sequences and thus $s(X) > 0$. Otherwise, in general, we still need to perform the described test, as there are sequence sets with $|X| \le n + 1$ and $s(X) > 0$. Applying additional constraints to the sequence sets we consider allows us to avoid that check even in that case. In the paper I’m doing this by defining a family of

*well-behaved*sequence sets.↩︎We cannot have infinite recursion without cycles, as there are only a finite number of sequence sets of a given sequence length.↩︎

This is the same notion as described in exercise 51 of [2], just recast in terms of partial sorting networks.↩︎