Blog on Jix' Site
https://jix.one/
Recent content in Blog on Jix' SiteHugo -- gohugo.ioen-usme@jix.one (Jannis Harder)me@jix.one (Jannis Harder)Thu, 21 Mar 2019 11:46:30 +0100Refactoring Varisat: 4. Heuristics
https://jix.one/refactoring-varisat-4-heuristics/
Thu, 21 Mar 2019 11:46:30 +0100me@jix.one (Jannis Harder)https://jix.one/refactoring-varisat-4-heuristics/<p>This is the fourth post in my <a href="https://jix.one/tags/refactoring-varisat">series about refactoring varisat</a>. In the <a href="https://jix.one/refactoring-varisat-3-cdcl/">last post</a> we saw how conflict driven clause learning works, in this post we’re going to make it fast. To get there we add several smaller features that were already present in varisat 0.1. While there are still some things missing that varisat 0.1 supports, these are features like proof generation or incremental solving that don’t affect the solving performance.</p>
<p>This means you can solve some non-trivial instances using varisat 0.2 today. I haven’t made a new release on crates.io yet, but you can install it directly from github (replacing any previous version installed):</p>
<p><code>
cargo install --force --git <a href="https://github.com/jix/varisat">https://github.com/jix/varisat</a>
<br>
varisat some_non_trivial_input_formula.cnf
</code></p>
<p>One part of making the solver faster were low-level optimizations in the unit-propagation implementation. I won’t go into the details of that, but you can look at the <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/prop/long.rs">new code</a>. The bigger part were several heuristics that play a big role in making CDCL perform well. Below I will explain each implemented in varisat so far.</p>
<h2 id="branching-variable-selection">Branching: Variable Selection</h2>
<p>Whenever unit propagation finishes but there are still unassigned variables left we need to make a guess. So far we used the first unassigned variable we found, which is not a very good strategy. It turns out that it helps to quickly produce conflicts and this is something to optimize for. It’s also important that the strategy is fast to compute.</p>
<p>The heuristic I’ve implemented is called “Variable State Independent Decaying Sum” or VSIDS. The idea is to keep an activity score for each variable which is updated after every conflict. When it’s time to decide on an unassigned variable to branch on, the one with the highest activity score is picked. Originally this heuristic kept scores for each possible literal, i.e. also guessed which polarity to pick. We will see that there is a better heuristic for selecting the polarity, so VSIDS is only used to select a variable.</p>
<p>The updating of activity scores is based on the observation that variables that were involved in recent conflicts are likely to be involved in future conflicts. A variable is involved if it is in the learned clause or was resolved during conflict analysis. There are multiple explanations contributing to this behavior, although I don’t think there is a simple one that can explain how well this strategy works. A part of it certainly is that for a given formula some variables are more likely than others to cause conflicts. On top of that there is an emergent behavior where the SAT solver focuses on a specific part of the search space. By picking active variables, those are more likely to be in future learned clauses, thus more likely to propagate, more likely to be involved in a future conflict and thus more likely to keep a high activity score. Other heuristics described later on also contribute to this.</p>
<p>Initially all activity scores are at zero. On conflict, the scores of the involved variables are incremented by one. This is called bumping. Afterwards the scores of all variables are multiplied by a constant decay factor smaller than 1. This gives more weight to recent bumps.</p>
<p>This is implemented efficiently by doing two things. The first is to keep the unassigned variables in a max-heap ordered by their activity score. This allows us to quickly identify the highest scoring variable. We allow assigned variables in the heap but make sure that no unassigned variable is missing. When we need to select a variable we pop variables until we find an unassigned one. During backtracking we need to re-insert all variables that become unassigned.</p>
<p>The decay operation scales all activities by the same value, this doesn’t change their relative order, so the heap stays valid. When a variable is bumped the relative order can change, so we need to update its position in the heap. This means we need to use a heap that supports updating of keys, which many variants do.</p>
<p>Using a heap to find the maximum score might seem useless when we’re touching every activity score anyway. This is where the second optimization comes in. Instead of multiplying all activity scores we keep a single scaling factor common to all variables. To bump a variable, we add this scaling factor instead of 1. To decay all values we simply divide the common scaling factor by the decay factor. For the heap operations we can ignore the scaling factor as it is the same for all values. This is much faster but will, at some point, overflow the floating point values used. To avoid this, we detect when the scaling factor becomes too large and apply it to all activity scores before resetting it back to 1. This happens rather infrequently.</p>
<p>The <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/decision/vsids.rs#L19">implementation</a> is a self contained module which gets called from <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/analyze_conflict.rs#L164">conflict analysis</a>, <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/prop/assignment.rs#L184">backtracking</a> and of course <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/decision.rs#L14">when making a decision</a>.</p>
<h2 id="branching-phase-saving">Branching: Phase Saving</h2>
<p>Now we know how to select a decision variable when branching. We still have to chose whether to assign true or false. Here a very simple but also very effective heuristic is used. We always choose the polarity that the variable had when it was last assigned. This is called phase saving.</p>
<p>I don’t have a great explanation for why this is a good strategy. It is sometimes compared to <a href="https://en.wikipedia.org/wiki/Local_search_(optimization)">stochastic local search</a> methods. These are incomplete algorithms for solving SAT that always keep a full assignment and repeatedly flip variables to decrease the number of unsatisfied clauses. I think it is interesting that here a strategy that tries to avoid conflicts is used. My current understanding is that this causes the solver to focus on finding a solution close to the saved assignment while the variable selection ensures that it will quickly learn why the current saved assignment might not work. This is something where I want to run more experiments to better understand the behavior.</p>
<h2 id="restarts">Restarts</h2>
<p>While these heuristics are a lot better than arbitrary choice, they are still not perfect. The runtime of a recursive search like this is very sensitive to the decisions made on the outer levels. For example if we have a satisfiable instance and make a wrong guess for the first decision, we will only undo that decision if we can prove that the guess was wrong. This could take a lot of steps, even if the opposite choice would directly satisfy the formula. Unsatisfiable instances show a similar behavior.</p>
<p>To limit the time spend solving with a potentially bad initial guess, from time to time the solver backtracks all the way. This is called a restart. This isn’t equivalent to starting over. As we keep the learned clauses, the variable activities and use phase saving there is a lot of useful information kept.</p>
<p>Now we need to decide how often such a restart should happen. In varisat I’m using a popular strategy based on the <a href="https://oeis.org/A182105">Luby sequence</a>. The idea here is to use a random variable to describe the time it would take to solve the instance without restarts. With each restart we take a sample of this random variable. If the result is smaller than the time until the next restart we solve the instance. Using successive values of the Luby sequence as times between restarts is good, no matter how the random variable is distributed. Asymptotically, it is at most a constant factor slower than any other strategy that is independent of the distribution and at most a logarithmic factor slower than any other strategy at all.</p>
<p>Now it isn’t quite true that restarts correspond to independent samples of a random variable, but in practice it’s still a good strategy and <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/schedule/luby.rs">easy to implement</a> as it is a static restart schedule. For some instances though, different restart strategies that are adaptive help a lot, so I will look into this in the future.</p>
<h2 id="clause-database-reduction">Clause Database Reduction</h2>
<p>In the last post I already mentioned that learning more and more clauses causes unit propagation to become slower and slower. To stop it from grinding to a halt we need to remove some learned clauses from time to time. It turns out to be useful to do this rather aggressively. Nevertheless we do want to keep clauses that are useful. This means we need to have a heuristic that allows us to identify those clauses.</p>
<p>Varisat uses two metrics to assess clauses. The first one is clause activity. This works very much like variable activity. It is increased whenever a clause is resolved as part of conflict analysis and decayed after a conflict. It uses the same scaling optimization but doesn’t use a heap.</p>
<p>The second metric used is the glue level also called literal block distance or LBD. This was introduced by the SAT solver <a href="https://www.labri.fr/perso/lsimon/glucose/">Glucose</a>. The idea is to measure how easily a clause becomes propagating. A first approximation would be the clause’s length. A longer clause requires more literals to be false to propagate the remaining literal. The crucial observation behind glue levels is that often some of these literals are assigned at the same time. To capture this, the literals of the clause are partitioned by the decision level in which they were assigned. The glue level is the number of partitions we get this way.</p>
<p>For each learned clause we store the smallest glue level observed. This is initialized when the clause is added and will be updated whenever a clause is resolved during conflict analysis and happens to have a smaller glue level. Note that Glucose computes glue levels that are one larger than what I’m using. This is because during backtracking the glue level of the newly learned clause changes when the assignment of the UIP flips and the clause goes from unsatisfied to propagating. Glucose uses the glue level during the conflict. I find it easier to reason about glue levels when they are defined for propagating clauses so I used that instead.</p>
<p>The way we use these metrics to reduce the number of learned clauses roughly follows an <a href="https://doi.org/10.1007/978-3-319-24318-4_23">approach described by Chanseok Oh</a>. The glue level is the main criterion for clause database reduction. Learned clauses are partitioned into three tiers depending on their glue level. Clauses with a glue of 2 or lower are the “Core” tier, clauses with a glue from 3 to 6 the “Mid” tier and the remaining ones the “Local” tier. When a clause’s glue level is reduced it can move to a different tier. Clauses in the Core tier are considered useful enough to be kept indefinitely. <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/clause/reduce.rs#L34">Half of the clauses in the Local tier are removed</a> every 15,000 conflicts. For the Mid tier we keep track which clauses were involved in a conflict. This is just a single bit per clause that we set at the same time that we’re bumping the clause’s activity. Every 10,000 conflicts all <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/clause/reduce.rs#L88">clauses from the Mid tier that were not involved in a conflict are demoted</a> to the Local tier.</p>
<p>While reducing the clause database we also make sure to never delete a clause that is currently propagating as this would invalidate the implication graph.</p>
<h2 id="recursive-clause-minimization">Recursive Clause Minimization</h2>
<p>In the last section I argued that a longer clause can be more useful than a shorter one if it is more likely to propagate. In contrast to that, a clause that consists of a subset of literals of another clause is always more useful. Whenever the longer of these propagates the shorter will too, but not the other way around.</p>
<p>Often when learning a clause from a conflict, there is shorter valid clause that is a subset of the initially computed clause. This happens when setting some of the clause’s literals to false implies that other literals of the clause are false. In that case those implied literals cannot satisfy the clause unless the other literals already do. Thus these literals are redundant and it is safe to remove them.</p>
<p>Recursive clause minimization tries to find such implications in the implication graph after conflict analysis. Apart from the UIP, each literal in the learned clause is tested for redundancy. This is done by performing a search following implications backwards. When a decision literal is hit, the search is aborted and the literal is not redundant. If another literal of the clause is hit, the search doesn’t expand that literal, but continues. When the search finishes without hitting a decision literal the literal we started with is redundant. I think this search is the reason for the name, although the search is implemented with a loop and explicit stack and not recursively.</p>
<p><a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/analyze_conflict.rs#L194">The implementation</a> makes a few optimizations that I won’t go into here, but that I explained in detail in the comments.</p>
<h2 id="unit-clause-simplification">Unit Clause Simplification</h2>
<p>Finally we can <a href="https://github.com/jix/varisat/blob/b74767575784b5e93e7f69cc7f96f246634ee65c/varisat/src/simplify.rs#L14">simplify</a> our formula when there are unit clauses present. Whenever top-level unit propagation finishes, the variables assigned can be removed from the formula. If a clause contains a literal that is assigned false, that cannot satisfy the clause anymore and thus that literal can be removed from the clause. If a clause contains a literal that is assigned true, it already is satisfied and we delete the whole clause.</p>
<h2 id="what-s-next">What’s Next?</h2>
<p>To get feature parity with varisat 0.1 I still need to add incremental solving and proof generation. This means there are still one or two posts left to complete this series.</p>
<p>If you don’t want to miss future posts, you can <a href="https://jix.one/index.xml">subscribe to the RSS feed</a> or follow <a href="https://twitter.com/jix_">me on Twitter</a>.</p>Refactoring Varisat: 3. Conflict Driven Clause Learning
https://jix.one/refactoring-varisat-3-cdcl/
Mon, 18 Mar 2019 20:00:53 +0100me@jix.one (Jannis Harder)https://jix.one/refactoring-varisat-3-cdcl/<p>This is the third post in my <a href="https://jix.one/tags/refactoring-varisat">series about refactoring varisat</a>. In this post the new code base turns into a working SAT solver. While you can use the command line tool or the library to solve some small and easy SAT problems now, there is still a lot ahead to gain feature and performance parity with varisat 0.1.</p>
<p>In the <a href="https://jix.one/refactoring-varisat-2-clause-storage-and-unit-propagation/">last post</a> we saw how unit propagation is implemented. When some variables are known, unit propagation allows us to derive the values of new variables or finds a clause that cannot be satisfied. Unit propagation alone isn’t enough though, as there is no guarantee to make progress. To continue the search for a satisfying solution after propagating all assignments, it is necessary to make a guess. A natural way to handle this would be recursion and backtracking. This would give us a variant of the <a href="https://en.wikipedia.org/wiki/DPLL_algorithm">DPLL algorithm</a> from which conflict driven clause learning evolved.</p>
<h2 id="dpll">DPLL</h2>
<p>When we combine unit propagation with backtracking we get an algorithm like this:</p>
<pre><code>solve(formula, partial assignment):
update the partial assignment using unit propagation
if there was a conflict:
return UNSAT
else if the assignment is complete:
return SAT
else:
make a decision l
(this selects an unassigned literal l using a branching heuristic)
if solve(formula, partial assignment + l) is SAT:
return SAT
return solve(formula, partial assignment + ¬l)
</code></pre>
<p>This is a variant of the <a href="https://en.wikipedia.org/wiki/DPLL_algorithm">DPLL algorithm</a>. The original DPLL algorithm also performs something called pure literal elimination. Pure literal elimination is a process that finds and assigns literals that appear only with one polarity within the not yet satisfied clauses. Such literals can be safely set as they cannot falsify clauses. This isn’t used in the search procedure of CDCL SAT solvers though, as detecting pure literals is too expensive for the advantage it offers.</p>
<p>The first problem is that this algorithm is recursive and copies the partial assignment to allow for
backtracking. The recursion can easily overflow the call stack and has much overhead. We can turn this into an iterative algorithm that undoes
assignments by adding two explicit stacks. The first stack, called the trail, records
all literals assigned. The other stack records which assignments in the trail
correspond to the decisions. This decision stack stores the lengths the trail
had when making a decision. The length of the trail is also called the depth.
The resulting algorithm is this:</p>
<pre><code>solve iterative(formula):
trail = [], decisions = [], partial assignment = {}
loop:
update the partial assignment using unit propagation
(record assignments in the trail)
if there was a conflict:
if the decision stack is empty:
return UNSAT
pop the last decision's depth from the decision stack
remove all assignments past the last decision from the trail and assignment
invert the polarity of the decision in the assignment
else if the assignment is complete:
return SAT
else:
push the current depth to the decision stack
make a decision l
add l to the assignment and trail
</code></pre>
<p>What I omitted from this description are the watchlists used for unit propagation and described in the <a href="https://jix.one/refactoring-varisat-2-clause-storage-and-unit-propagation/">last post</a>. They are only updated in the call to unit propagation as they are designed to allow backtracking without requiring updates. Undoing the partial assignments is enough.</p>
<p>The <a href="https://github.com/jix/varisat/blob/cb84e091805bf66469e326bd72ba422653b3d4dd/varisat/src/prop/assignment.rs#L42">trail and decision stack</a> is implemented pretty much like described here. The backtracking search however is something that can be improved a lot. To see why we have to take a closer look at why it can be inefficient.</p>
<h2 id="duplicated-work">Duplicated Work</h2>
<p>A backtracking search like this is inefficient as it often duplicates work just to discover the same things over and over again. Let’s consider a formula containing the clauses $C_1 = a \vee b \vee c \vee d$ and $C_2 = a \vee b \vee c \vee \neg d$ as well as a bunch of clauses with variables $x_1, x_2, \ldots, x_n$. At some point the trail and decision stack could look like this:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots], [\neg x_3, \ldots], [x_1, \ldots], [\neg a, \ldots, d]
\end{equation}
<p>The brackets group assignments on the same decision level, i.e. the decision stack points to each opening bracket. The ellipses stand for possible propagated assignments not relevant to this example. The final assignment $d$ was propagated by the clause $C_1$ which became unit. This in turn makes the clause $C_2$ false and we have a conflict. Our backtracking search would remove all assignments $[\neg a, \ldots, d]$ of the last decision level and add the negated decision to the then topmost level:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots], [\neg x_3, \ldots], [x_1, \ldots, a]
\end{equation}
<p>The problem shows up when the decision $x_1$ leads to a conflict. When the assignments of the topmost level are reverted the assignment $a$ is also reverted. This means the solver is free to try $\neg a$ again:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots], [\neg x_3, \ldots, \neg x_1], [\neg a, \ldots, d]
\end{equation}
<p>As $\neg b$ and $\neg c$ are still assigned this again propagates $d$ using $C_1$ and makes $C_2$ false. This causes another backtracking:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots], [\neg x_3, \ldots, \neg x_1, a]
\end{equation}
<p>Should we discover that $\neg x_3$ leads to a conflict this could happen a third time leading to:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots, x_3, a]
\end{equation}
<p>In this example only two clauses are involved in the repeated sequence of identical propagations. In realistic scenarios this happens regularly with much longer sequences.</p>
<h2 id="non-chronological-backtracking">Non-Chronological Backtracking</h2>
<p>To avoid some of this duplicated work we can introduce non-chronological backtracking (also called <a href="https://en.wikipedia.org/wiki/Backjumping">backjumping</a>). Non-chronological backtracking identifies which decisions lead to the conflict and undoes further decisions as long as they are not involved. This is a sound strategy as it results in the same final state as making the conflict causing decision at an earlier point would have.</p>
<p>Let’s consider our example again:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots], [\neg x_3, \ldots], [x_1, \ldots], [\neg a, \ldots, d]
\end{equation}
<p>Here the decision $\neg x_3$ and $x_1$ are not necessary to cause a conflict. This means we can also remove them while backtracking, adding the assignment $a$ to the decision level of $\neg c$:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots, a]
\end{equation}
<p>This is the same state normal backtracking would have produced if the decision $\neg a$ was made after $\neg c$ and the trail looked like this:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots], [\neg a, \ldots, d]
\end{equation}
<h2 id="implication-graph">Implication Graph</h2>
<p>To implement non-chronological backtracking we need to determine which prefix of decisions is sufficient to cause the current conflict. This is done by maintaining an implication graph during unit propagation. The implication graph is a <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">DAG</a> with the assigned literals as nodes. Each assigned literal has the reason for its assignment as incoming edges. Decisions therefore have no incoming edges.</p>
<p>This is <a href="https://github.com/jix/varisat/blob/cb84e091805bf66469e326bd72ba422653b3d4dd/varisat/src/prop/graph.rs#L48">implemented by storing a reference to the propagating clause for each assignment</a>. As a clause propagates when it becomes unit, there is an edge from the negation of each false literal to the single true literal of the clause.</p>
<p>The involved decisions of a conflict are the 0-in-degree predecessors of the negated falsified clause’s literals. Those can be found by a graph search. We will see how this search is implemented after considering another improvement that avoids even more duplicated work.</p>
<h2 id="clause-learning">Clause Learning</h2>
<p>Even with non-chronological backtracking the algorithm can rediscover essentially the same conflict. We continue the example where we backtracked to the decision level of $\neg c$ and arrived at:</p>
\begin{equation}
[x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots, a]
\end{equation}
<p>At some point in the future we might get a conflict caused by $x_2$, resulting in a trail of just $[\neg x_2]$. If we continue the search it’s likely that we’ll make the decisions $\neg b$ and $\neg c$ again:</p>
\begin{equation}
[\neg x_2, \ldots], [\neg b, \ldots], [\neg c, \ldots]
\end{equation}
<p>We already know that $\neg b$ and $\neg c$ imply $a$, but during backtracking all the way to the top, the algorithm forgot about that. Instead it will have to try $\neg a$ first to realize that this leads to a conflict.</p>
<p>To avoid this, a new clause is added to the formula after every conflict. This is called conflict driven clause learning. That clause must be implied by the formula and should stop the algorithm from repeatedly making the same decisions. The added clause is also important to maintain our implication graph after backtracking. The way the implication graph is implemented, a clause that is unit under the partial assignment is required for each propagated literal. In the example, without learning a clause, there is no clause that justifies assigning $a$ after backtracking from $\neg a$.</p>
<p>One way to add such a learned clause would be to add a clause consisting of the negation of all involved decisions. This would be $C = b \vee c \vee a$ in the example. This clause then propagates $a$ after backtracking the initial conflict as well as after repeating the decisions $\neg b$ and $\neg c$ at a later point. Such a clause $C$ is implied by the formula $F$, as the conflict means that $F \wedge \neg C = \bot$ which is equivalent to $F \to C$.</p>
<h2 id="first-unique-implication-point">First Unique Implication Point</h2>
<p>In practice there is a way to learn a better clause than the one blocking the involved decisions. We can learn any clause that is implied by the formula and causes unit propagation to assign the negation of the last involved decision given the other involved decisions.</p>
<p>Such a clause can be found by starting with the conflict clauses literals and successively replacing a literal with the negation of the reason it was assigned false. Such a replacement maintains the fact that assigning all literals of the clause causes a conflict. An alternative justification for this process is that it is equivalent to a <a href="https://en.wikipedia.org/wiki/Resolution_(logic)">resolution</a> of the current clause with the reason clause of an assignment falsifying a literal in the current clause.</p>
<p>This doesn’t specify in which order the literals should be resolved (replaced by their reason) and when to stop. It also doesn’t ensure that the clause is compatible with non-chronological backtracking. To make it compatible with backtracking we need to ensure that there is only one literal of the current decision level left in the clause. This causes the clause to become unit after backtracking. Such a single literal is called unique implication point or UIP. To find a clause that includes a UIP we can simply count the number of literals in the conflicting decision level and stop the resolution at any point where there is only one left.</p>
<p>There still might be multiple possible clauses with a UIP and we still don’t know in what order to resolve literals. A useful heuristic observation here is that each resolution step reduces the set of assignments that could cause the learned clause to propagate. When we replace a literal by the reason it was propagated, we block only that reason, while there might be multiple possible ways for that literal to get propagated. This isn’t an exact argument, but gives an intuition that matches what I’ve seen in practice.</p>
<p>To find a UIP containing clause that still blocks as much as possible, we stop as soon as we get a UIP. To only do resolving that will get us closer to a UIP, we resolve the literals in the reverse order they were assigned. This finds the first possible UIP, and thus is called 1-UIP. This means we will only resolve literals of the conflicting decision level. The target level for non-chronological backtracking then is the highest level among the literals apart from the UIP.</p>
<p>Using the reverse-chronological order for resolution means we can walk backwards over the trail. We start with the conflict clause and for each literal in the trail we check if the literal is in the clause. If it is, we replace it with its reason. As soon as there is only one literal of the current decision level left, we are done. <a href="https://github.com/jix/varisat/blob/cb84e091805bf66469e326bd72ba422653b3d4dd/varisat/src/analyze_conflict.rs">The implementation</a> uses a bitmap to store which literals of the current decision level are currently present. Whenever a literal of another decision level is reached it is directly added to the learned clause, as we will not resolve it.</p>
<h2 id="api-and-command-line-interface">API and Command Line Interface</h2>
<p>Together with a dummy branching heuristic that always chooses the first unassigned literal, this gives a working SAT solver. To make it usable I also added a <a href="https://github.com/jix/varisat/blob/cb84e091805bf66469e326bd72ba422653b3d4dd/varisat/src/solver.rs">public API</a> and a simple <a href="https://github.com/jix/varisat/blob/cb84e091805bf66469e326bd72ba422653b3d4dd/varisat-cli/src/main.rs">command line interface</a>. To try the current master you can install it from github (replacing any previous version) by running</p>
<p><code>
cargo install --force --git <a href="https://github.com/jix/varisat">https://github.com/jix/varisat</a>
<br>
varisat some_input_formula.cnf
</code></p>
<p>This already includes stuff I’ll describe in a future post.</p>
<h2 id="what-s-next">What’s Next</h2>
<p>A working SAT solver doesn’t mean it’s a SAT solver you’d want to use yet. The dummy branching heuristic is very bad for performance and learning more and more clauses causes unit propagation to become slower and slower. In the next post I’ll go over what’s missing to make it as fast as varisat 0.1 was.</p>
<p>If you don’t want to miss future posts, you can <a href="https://jix.one/index.xml">subscribe to the RSS feed</a> or follow <a href="https://twitter.com/jix_">me on Twitter</a>.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
displayAlign: "left",
displayIndent: "2em",
CommonHTML: { linebreaks: { automatic: true } }
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>Refactoring Varisat: 2. Clause Storage and Unit Propagation
https://jix.one/refactoring-varisat-2-clause-storage-and-unit-propagation/
Sat, 02 Mar 2019 18:18:04 +0100me@jix.one (Jannis Harder)https://jix.one/refactoring-varisat-2-clause-storage-and-unit-propagation/<p>This is the second post in my <a href="https://jix.one/tags/refactoring-varisat">series about refactoring varisat</a>. Since the last post I started implementing some of the core data structures and algorithms of a CDCL based SAT solver: clause storage and unit propagation. In this post I will explain how the these parts work and the rationale behind some of the decisions I made.</p>
<h2 id="developer-documentation">Developer Documentation</h2>
<p>For the new varisat code base I’m not just documenting the public API but am also adding a lot of developer documentation. <a href="https://jix.github.io/varisat/varisat/">The developer documentation for the latest master</a> is automatically built. While I’m trying to document everything while writing the code, I also plan to continuously revisit and improve the existing documentation. I’m also open to feedback here. If you think some code needs additional documentation or find existing documentation not clear enough, feel free to <a href="https://github.com/jix/varisat/issues/new">file an issue</a> about this.</p>
<h2 id="the-context-struct">The Context Struct</h2>
<p>Internally a SAT solver uses a bunch of different data structures that store different information derived from the formula. To solve the satisfiability problem different routines use and update different subsets of these structures, deriving further information. All these data structures are bundled in an outer containing data structure that I call the <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/context.rs#L34">context</a>.</p>
<p>Most routines work with multiple of these data structures at the same time. These routines also often call sub-routines that work on a different subset of these data structures. This results in most SAT solvers I’m aware of implementing most routines as methods of this context structure (although often called differently).</p>
<p>This isn’t great for documenting the dependencies between different routines and data structures. If there was a clear hierarchy of data structures and how they’re used, nesting them accordingly would solve this problem. This doesn’t work too well for SAT solvers though. No matter how you organize the data structures, there are always some routines that use data structures far apart. Even if you find a way to mostly avoid this, you never know the requirements of new techniques you might add later.</p>
<p>Nevertheless this is the approach taken by most SAT solvers and the first version of varisat. There is a second problem, though and that is specific to rust. The borrow checker requires one to be much more careful with passing references to functions. For a solver written in C++ it is always possible to pass a reference to the outer context structure to give a function access to any subset of the contained data structures. In rust this only works if no data is borrowed elsewhere at the same time.</p>
<p>For the first version of varisat I worked around this by passing references to different contained data structures individually. Code written using this workaround becomes hard to read and change. Passing all the extra parameters clutters the code and accessing a new data structure in one place requires coordinated changes in many only slightly related places. The only upside is that the code and data dependencies are clearly documented.</p>
<p>Ideally I’d like to document which structures are accessed by each function in its declaration without it repeating that information in any function’s code. This would be enough to statically ensure that rust’s dynamic aliasing invariants are upheld. In a way this would allow partial borrowing, which already works within a function, to work everywhere. While I was thinking about this, I came across <a href="http://smallcultfollowing.com/babysteps/blog/2018/11/01/after-nll-interprocedural-conflicts/">Niko Matsakis’ blog post about this limitation of rust</a>. While it gave me some hope that rust will allow some form of partial borrowing in the future, I needed this sooner.</p>
<p>This is why I came up with the <a href="https://jix.one/introducing-partial_ref/">partial_ref</a> library. It uses macros and trait based meta-programming to emulate the feature I need. The implementation is inspired by the techniques I’ve learned from the <a href="https://crates.io/crates/frunk">frunk</a> crate. With this I can annotate the fields of the <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/context.rs#L34">context</a> struct and then declare partial references like this: <code>partial!(Context, mut ClauseAllocP, mut ClauseDbP, BinaryClausesP)</code>. The types ending in <code>P</code> are marker types declaring which contained data structures are borrowed. Re-borrowing a subset of a partial reference can be done by the <code>borrow</code> or <code>split_borrow</code> methods, which infer the required parts while statically checking the borrowing rules. The individual parts can be accessed using the <code>part</code> or <code>split_part</code> method. The <code>split_</code> variants allow simultaneous borrows of non-overlapping or non-mutable partial references. This still requires calls to <code>part</code> or <code>borrow</code> in many places, but avoids repeating the list of used data structures everywhere.</p>
<p>All this happens at compile time. The generated code uses a single pointer to represent a partial reference. This can lead to better code generation than any of the more verbose workarounds, as they can’t guarantee that the individual references point into the same containing structure.</p>
<p>All in all I’m very happy with how using <code>partial_ref</code> turned out so far.</p>
<h2 id="the-clause-allocator">The Clause Allocator</h2>
<p>Varisat is a <a href="https://en.wikipedia.org/wiki/Conflict-Driven_Clause_Learning">conflict driven clause learning</a> (CDCL) based SAT solver. A CDCL based solver works on a Boolean formula in <a href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">conjunctive normal form</a> (CNF), adding and removing clauses, while keeping the formula equisatisfiable to the input formula. As such it needs to store the clauses of the current formula.</p>
<p>A <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/clause/alloc.rs#L31">dedicated allocator</a> is used to speed up allocation while at the same time reducing memory usage and fragmentation. Instead of using an individual <code>Vec<Lit></code> for each clause there is a single <code>Vec<LitIdx></code> used as a buffer for storing all clauses. <code>LitIdx</code> is the underlying integer type used to represent literals. For each clause in the buffer, the literals are preceded by a <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/clause/header.rs#L25">clause header</a>. The header contains the length of the clause as well as other metadata associated with each clause.</p>
<p>By making use of rust’s <code>#[repr(transparent)]</code> it is possible to safely store both the header and the literals in the same vector. This also allows us to define a <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/clause.rs#L25">dynamically sized type for clauses</a> and safely cast slices of our buffer into references of this clause type.</p>
<p>Storing such references in other data structures isn’t feasible though. The context data structure would become self-referential, with all the problems this brings in rust. Another problem is that we couldn’t grow the buffer while references pointing into it exist. Instead we define a new type for long lived <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/clause/alloc.rs#L166">clause references</a> which stores an offset into the buffer. This also allows us to use an integer type smaller than a pointer, saving memory in all places where many clauses a referenced.</p>
<p>Whenever a new clause is allocated it is simply appended to the buffer. Clauses are never deleted from the buffer, they are just marked as deleted. Reclaiming the space used by deleted clauses is handled from outside of the clause allocator by creating a new allocator which uses a new buffer and then copying just the non-deleted clauses.</p>
<h2 id="the-clause-database">The Clause Database</h2>
<p>The clause allocator provides storage for clauses, but it doesn’t keep track of the allocated clauses. This is done by the <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/clause/db.rs#L44">clause database</a>. It stores references to all clauses of the current formula, which are used for <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/clause/gc.rs#L11">garbage collection</a> It also stores a partition of the clauses into four different tiers. Three tiers are for redundant clauses. Redundant clauses are those where the formula with them is equivalent to the formula without them. This is the case for newly learned clauses. The remaining tier consists of the irredundant clauses, whose removal may change the solution set. Initially all clauses of the input formula are considered irredundant, even though some might actually be redundant.</p>
<p>The partition in redundant/irredundant allows us to remove some redundant clauses from time to time, which is needed for solving performance and memory use. The more clauses are stored, the slower the solving becomes. On the other hand learning new clauses is how a CDCL solver makes progress. Splitting the irredundant clauses into three tiers is part of the heuristic used to decide which clauses to remove. I’ll write more about that in a later post.</p>
<h2 id="binary-clauses">Binary Clauses</h2>
<p>Clauses with only two literals, called <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/binary.rs#L7">binary clauses</a>, are handled separately from other clauses. A binary clause $x \vee y$ is equivalent to the implications $\neg x \to y$ and $\neg y \to x$. Knowing that one of the literals is false we can <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/prop/binary.rs#L13">derive that the other must be true</a>, without looking at any other literals. We also never want to forget binary clauses as they are useful for making progress and take up little storage. Instead of storing them in the clause allocator, each literal has a list of literals implied through binary clauses. As we won’t forget binary clauses there is no need to remember which are redundant and which aren’t. This results in just a <code>Vec<Lit></code> per literal, where each binary clause results in one entry of two vectors. There is no need to store any extra metadata and thus a very compact representation. While most solvers special case binary clauses, not all solvers use this approach.</p>
<h2 id="watchlists-and-unit-propagation">Watchlists and Unit propagation</h2>
<p>We already considered how binary clauses allow us to derive the value of variables given the assignment of other variables. This can be generalized to long clauses. A literal that is false can be removed from a disjunction without changing the result. Thus we can remove false literals to simplify a clause. When all but one literal are removed, the clause becomes a unit clause. To satisfy a unit clause we need to assign the remaining literal. It can also happen that all literals are false and we end up with an empty clause. This means that our assignment is not compatible with the clause and is called a conflict.</p>
<p>As we’re going to repeatedly assign literals and then unassign literals when we backtrack, we’re not going to actually remove any literals. Instead we look for clauses where all but one literal of a clause are assigned false, and thus the remaining literal has to be true to satisfy the clause.</p>
<p>Even though we’re not removing any literals we still say that a clause becomes unit when all but one literals are assigned false. The process of iteratively assigning true to the remaining literal of a clause that became unit is called unit propagation.</p>
<p>SAT solvers spend most of their time doing unit propagation. Therefore it is important to do this as efficient as possible. A naive way to perform unit propagation would be to iterate through all clauses, and for each clause count the non-false literals. This requires looking at all clauses whenever a single literal is assigned.</p>
<p>As a first improvement we could keep track of which clauses contain which literals. Then, whenever a literal is assigned false, only clauses containing that literal can become unit and we would only have to look at those. The approach taken by CDCL based solvers improves this further.</p>
<p>The algorithm used is based on a simple observation: a clause cannot become unit as long as it has two non-false literals. This might sound like an obvious statement, but nevertheless allows us to speed up unit propagation quite a bit. Instead of tracking all literals of all clauses, we track two non-false literals per unsatisfied clause. These literals are called watched literals. For each literal we keep a list of clauses where it is among the two watched literals. These are called <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/prop/watch.rs#L51">watchlists</a>. As long as these literals aren’t assigned false, we don’t care about what happens to the other literals of a clause, as there is no way for that clause to become unit. Only when one of the watched literals is assigned false, we process the clause. Ignoring the conflict case for now, three things can happen: 1) we find the clause has a true literal and thus is satisfied, 2) the clause is unit and we get a new assignment, or 3) we can find a replacement non-false literal for the watched literal that was assigned.</p>
<p>To illustrate this consider this example where $a$ and $b$ are assigned true, $x$ and $y$ are unassigned and $g$ was just assigned false. The watched literals are underlined.</p>
<ol>
<li><p>$\neg b \vee \underline g \vee \underline x \vee a$, finding the true literal $a$, resulting in $\neg b \vee g \vee \underline x \vee \underline a$.</p>
<p>Here the clause is satisfied. Making a true literal a watched literal handles backtracking. This ensures that we’re again watching two non-false literals as soon as as the true literal becomes unassigned.</p></li>
<li><p>$\neg a \vee \underline g \vee \neg b \vee \underline {\neg x}$, where the clause becomes unit, implying $\neg x$.</p>
<p>Here the new assignment $\neg x$ is found. The watched literals do not change. This is compatible with backtracking. The clause can only become non-unit when the assignment to $\neg g$, and with it the implied assignment to $\neg x$, is removed. At that point both watched literals are non-false.</p></li>
<li><p>$\neg y \vee \underline g \vee \neg b \vee \underline x$, finding the non-false literal $\neg y$, resulting in $\underline {\neg y} \vee g \vee \neg b \vee \underline x$.</p>
<p>Here we just maintain two watched non-false literals.</p></li>
</ol>
<p>There are a few additional optimizations implemented: The watched literals are always moved to the beginning of the clause, so we know which literals are watched without scanning the watchlists. Also the watchlists contain a “blocking literal” for each watched clause, which is just another literal of the clause. The blocking literal allows use to detect some satisfied clauses without accessing the memory used for clause storage.</p>
<p>Currently <a href="https://github.com/jix/varisat/blob/9f1533b3a3ce465f89bf417b17162a948a7627a0/varisat/src/prop/long.rs#L19">the implementation of unit propagation</a> uses only safe abstractions that perform bounds checking. For varisat 0.1 I used unchecked accesses that relied on the correctness of other parts of the solver. This resulted in pervasive unsafe annotations. When this rewrite is fully functional, I plan to carefully benchmark the difference and only remove bound checks where necessary. Right now I expect that I will start to use unsafe code that will manually perform a minimal number of bound checks. This should get me the performance I want, without relying on global invariants for memory safety.</p>
<h2 id="what-s-next">What’s Next?</h2>
<p>The next steps ahead are implementing conflict analysis and backtracking. I already implemented some supporting code for this that I didn’t wrote about yet. Together with a dummy branching heuristic this is enough to solve some small formulas, so I also plan to add a public API and a command line interface.</p>
<p>If you don’t want to miss future posts, you can <a href="https://jix.one/index.xml">subscribe to the RSS feed</a> or follow <a href="https://twitter.com/jix_">me on Twitter</a>.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>Refactoring Varisat: 1. Basics and Parsing
https://jix.one/refactoring-varisat-1-basics-and-parsing/
Sun, 03 Feb 2019 17:37:42 +0100me@jix.one (Jannis Harder)https://jix.one/refactoring-varisat-1-basics-and-parsing/<p>This is the first post in a <a href="https://jix.one/tags/refactoring-varisat">series of posts</a> I plan to write while refactoring my
SAT solver varisat. In the progress of developing varisat into a SAT solver that can <a href="http://sat2018.forsyte.tuwien.ac.at/index.php?cat=results">compete with some well known SAT solvers</a> like minisat or glucose, it accumulated quite a bit of technical debt. Varisat is the first larger project I’ve written in rust and there a lot of things I’d do differently now. Before I can turn varisat into a solver that competes with the fastest solvers out there, I need to do some refactoring.</p>
<p>My current plan is to start a new project from scratch copying over bits that I want to keep and rewriting parts that I don’t. That’s usually my preferred way to refactor when I plan to change the overall architecture. The new version will be varisat 0.2, hopefully turning into varisat 1.0.</p>
<p>This refactoring also gives me the chance to write this series of posts, which should make it much easier to understand the code base and contribute to varisat. I’m also moving <a href="https://github.com/jix/varisat">varisat to GitHub</a> and stable rust which should make collaboration within the rust open source ecosystem easier.</p>
<p>I also want to use my new library <a href="https://jix.one/introducing-partial_ref/">partial_ref</a> which should result in way less fighting the borrow checker. Currently in varisat there are a lot of functions that take way too many parameters, mostly references to different data structures. This is caused by the borrow checker being not flexible enough across function calls. My partial_ref library offers a workaround that I think will be an improvement compared to the workarounds I’ve been using before.</p>
<h2 id="cnf-formulas">CNF Formulas</h2>
<p>SAT solvers determine whether a Boolean formula can be satisfied. They either find an assignment (also called interpretation) of the formula’s variables so that the formula is true, or produce a proof that this is impossible. Usually SAT solvers require the input to be in <a href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">conjunctive normal form</a> (CNF). This means that the formula is a conjunction (Boolean and) of clauses, where a clause is a disjunction (Boolean or) of literals and a literal is a variable or a negated variable. An assignment satisfies a formula in CNF precisely when at least one literal of each clause is true.</p>
<p>SAT solvers require the input to be in CNF as this is the internal representation used. This isn’t a big restriction though, as it is possible to turn any Boolean formula into an equisatisfiable formula in CNF with <a href="https://en.wikipedia.org/wiki/Tseytin_transformation">only linear overhead by introducing new variables</a>. Equisatisfiable means that either both formulas are satisfiable or both are not. This is weaker condition than equivalence which means that exactly the same assignments satisfy both formulas. Here equisatisfiability allows the introduction of new helper variables.</p>
<p>In varisat 0.1, a formula is directly parsed into the internal data structures of the solver. There is no standalone data type representing a CNF formula. Such a data type isn’t very useful inside the solver. More specialized data structures are used there. Nevertheless, I think such a type would be useful when using varisat as a library. It makes it easier to re-use the parser or write other code that processes CNF formulas.</p>
<p>To represent a formula in CNF we need a type for variables and for literals. Variables are indexed using integers and are represented by their index. This is also the encoding used by the <a href="https://www.satcompetition.org/2009/format-benchmarks2009.html">standard CNF file format (DIMACS CNF)</a>. Internally for varisat the first variable has index 0, while in DIMACS CNF the first variable has index 1. For everything user facing the 1-based DIMACS CNF encoding will be used.</p>
<p>A literal is represented by their variable’s index and a flag that tells us whether the literal is negated or not. In DIMACS CNF the flag is represented by negating the variable index. That doesn’t work for the variable with index 0 though. So to avoid that problem, internally literals use the least significant bit as a negation marker, shifting the variable index one bit to the left.</p>
<p>To save on memory usage and bandwidth, literals and variables are stored in 32-bit. This limits the number of variables to $2^{31}$. For now the actual limit in varisat is set quite a bit below $2^{31}$, leaving room for further flags or sentinel values. As far as I know, most SAT solvers do this.</p>
<p>For variables and literals I’m quite happy with the existing code from varisat 0.1, providing the types <code>Var</code> and <code>Lit</code>, so I added it almost <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/lit.rs">verbatim to the new project</a>.</p>
<p>Equipped with literals we can now implement a type to store a CNF formula. It would be possible to just use a <code>Vec<Vec<Lit>></code>, but that requires an allocation for each clause. Given that formulas with millions of clauses are used in practice, that doesn’t sound so good. Instead I’m going to use a struct with a <code>Vec<Lit></code> containing the literals for all clauses and a <code>Vec<Range<usize>></code> containing the range where each clause’s literals are stored. You can <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/cnf.rs">see the implementation here</a>.</p>
<h2 id="parsing">Parsing</h2>
<p>In varisat 0.1 I tried to make the parser as forgiving as possible. It also completely ignored the header line. I think that was a mistake. It added complexity, causing a long standing bug when parsing empty clauses and made it less likely to detect formulas that somehow got truncated. I still don’t require a header, but if one is present and the formula doesn’t match the header, at least a warning is generated.</p>
<p>The <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/dimacs.rs">parsing code itself</a> is a hand-rolled parser that is largely based on the one in varisat 0.1. You can feed it chunks (byte slices) of input data. While parsing a chunk, clauses are added to a parser internal CNF formula. At any point it is possible to retrieve the clauses parsed so far, clearing the internal CNF formula. This allows for incremental parsing, which is useful for the solver with its own clause database, but also allows for parsing a complete file into a single CNF formula, useful for various utilities. Varisat 0.1 also used incremental parsing, but instead of the caller asking for the clauses parsed so far, it required a callback to process clauses. Combined with error handling that approach wasn’t nice to use.</p>
<p>Also new is some code to write a CNF formula back into a file. This is useful for writing various CNF processing utilities, but even more important for testing.</p>
<h2 id="testing">Testing</h2>
<p>While varisat 0.1 had some tests, my plan is to get much better test coverage for varisat 0.2. I had quite a few bugs that stayed undetected way too long. To make testing easier I’ll be using property based testing for varisat 0.2. I’ve been using property based testing before, for example using <a href="http://hackage.haskell.org/package/QuickCheck">QuickCheck</a> in Haskell or using <a href="https://hypothesis.works/">Hypothesis</a> in Python. Recently I’ve discovered the excellent <a href="https://crates.io/crates/proptest">proptest</a> crate, which is inspired by Hypothesis.</p>
<p>Property based testing changes the focus from individual test cases to more general properties. You specify a set of values, using combinators provided by the library, and some property that should hold for those values. The property is written as normal rust code using assertions. The library will then sample lots of values matching your specification and test them against your property, often finding non-obvious corner cases in the process. In a way it is a hybrid between <a href="https://en.wikipedia.org/wiki/Fuzzing">fuzzing</a> and unit tests. When a counterexample is found, propcheck also systematically tries to find simpler counterexamples by shrinking the values. It also saves counterexamples for future regression testing.</p>
<p>The CNF parser and writer are a good <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/dimacs.rs#L535-L543">example for this</a>. By generating random CNF formulas, writing them, parsing them back and comparing the result, a lot of code paths are exercised and tested with very little effort.</p>
<p>I usually try to write a generic property test first, and then use a code coverage tool to identify what else needs to be tested. In the case of this parser what was left was mostly the checks for invalid syntax and integer overflows.</p>
<h2 id="what-s-next">What’s next?</h2>
<p>The next thing to add is a clause database, probably followed by unit propagation. After that there will be conflict analysis, branching heuristics, learned clause minimization, glue computation, database reduction, database garbage collection, restarts and proof generation, although not in that order. If I’m done with that I’ll be roughly at a point where I am now with varisat 0.1, but hopefully with a much cleaner code base.</p>
<p>I plan to cover the complete refactoring process on this blog. I’m not sure how regular and in what detail, but my goal is that someone not familiar with SAT solver internals can use this series of posts as a starting point for hacking on varisat.</p>
<p>If you don’t want to miss future posts, you can <a href="https://jix.one/index.xml">subscribe to the RSS feed</a> or follow <a href="https://twitter.com/jix_">me on Twitter</a>.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>Introducing partial_ref
https://jix.one/introducing-partial_ref/
Mon, 24 Dec 2018 14:07:10 +0100me@jix.one (Jannis Harder)https://jix.one/introducing-partial_ref/<p>Recently there has been some discussion about <a href="http://smallcultfollowing.com/babysteps/blog/2018/11/01/after-nll-interprocedural-conflicts/">interprocedural borrowing conflicts</a> in rust. This is something I’ve been fighting with a lot, especially while working on my SAT solver <a href="//project/varisat">varisat</a>. Around the time Niko Matsakis published his blog post about this, I realized that the existing workarounds I’ve been using in varisat have become a maintenance nightmare. Making simple changes to the code required lots of changes in the boilerplate needed to thread various references to the places where they’re needed.</p>
<!-- more -->
<p>While I didn’t think that a new language feature to solve this would be something I’d be willing to wait for, I decided to sit down and figure out how such a language feature would have to look like. I knew that I wanted something that allows for partial borrows across function calls. I also prefer this to work with annotations instead of global inference. While trying to come up with a coherent design that fits neatly into the existing type and trait system, I realized that most of what I wanted can be realized in stable rust today.</p>
<p>Luckily some time ago I came across the <a href="https://crates.io/crates/frunk">frunk</a> crate. From there I learned a trick that I’d call inference driven metaprogramming. Rust requires trait implementations to be unambiguously non-overlapping. The rules for this just consider the implementing type, not any bounds. The trick I’ve learned from frunk is to add an additional type parameter to the trait that would otherwise have overlapping implementations. That type parameter is only used to disambiguate the implementations. As long as there is only one implementation, but this time considering bounds, rust’s powerful type inference will infer that extra type parameter. An example would be frunk’s <a href="https://docs.rs/frunk/0.2.2/frunk/hlist/trait.Plucker.html"><code>Plucker</code></a> trait, where the <code>Index</code> type parameter selects between the otherwise overlapping instances.</p>
<p>Equipped with this, I was able to implement type-level borrow checking logic. Today I’ve released a <a href="https://crates.io/crates/partial_ref">first version of this</a>. The documentation contains a <a href="https://docs.rs/partial_ref/0.1.0/partial_ref/#tutorial">small tutorial</a>. I also documented the <a href="https://github.com/jix/partial_ref/blob/1c201f929f99d363b6cab326e483be72e3f51774/partial_ref/src/lib.rs#L623">type-level borrow checking logic</a>, as the involved types and traits will appear in error messages. While I tried to optimize for readable error messages, I think every trait and type that could be part of one should be documented.</p>
<p>Using the library looks like this (see the tutorial for an explanation):</p>
<pre><code class="language-rust">use partial_ref::*;
part!(pub Neighbors: Vec<Vec<usize>>);
part!(pub Colors: Vec<usize>);
part!(pub Weights: Vec<f32>);
#[derive(PartialRefTarget, Default)]
pub struct Graph {
#[part = "Neighbors"]
pub neighbors: Vec<Vec<usize>>,
#[part = "Colors"]
pub colors: Vec<usize>,
#[part = "Weights"]
pub weights: Vec<f32>,
}
let mut g = Graph::default();
let mut g_ref = g.into_partial_ref_mut();
g_ref.part_mut(Colors).extend(&[0, 1, 0]);
g_ref.part_mut(Weights).extend(&[0.25, 0.5, 0.75]);
g_ref.part_mut(Neighbors).push(vec![1, 2]);
g_ref.part_mut(Neighbors).push(vec![0, 2]);
g_ref.part_mut(Neighbors).push(vec![0, 1]);
pub fn add_color_to_weight(
mut g: partial!(Graph, mut Weights, Colors),
index: usize,
) {
g.part_mut(Weights)[index] += g.part(Colors)[index] as f32;
}
let (neighbors, mut g_ref) = g_ref.split_part_mut(Neighbors);
let (colors, mut g_ref) = g_ref.split_part(Colors);
for (edges, &color) in neighbors.iter_mut().zip(colors.iter()) {
edges.retain(|&neighbor| colors[neighbor] != color);
for &neighbor in edges.iter() {
add_color_to_weight(g_ref.borrow(), neighbor);
}
}
</code></pre>
<p>I have a bunch of additional features planned, but the next thing I want to do is to refactor my SAT solver to use this library.</p>
<p>I also hope that this library can be used to experiment with partial borrowing to gather experience for a possible future language extension.</p>
Encoding Matrix Rank for SAT Solvers
https://jix.one/encoding-matrix-rank-for-sat-solvers/
Fri, 07 Dec 2018 19:59:25 +0100me@jix.one (Jannis Harder)https://jix.one/encoding-matrix-rank-for-sat-solvers/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>
<p>I’m working on a problem where I want to use a SAT solver to check that a property $P(v_1, \ldots, v_n)$ holds for a bunch of vectors $v_1, \ldots, v_n$, but I don’t care about the basis choice. In other words I want to check whether an arbitrary invertible linear transform $T$ exists so that the transformed vectors have a certain property, i.e. $P(T(v_1), \ldots, T(v_n))$. I solved this by finding an encoding for constraining the rank of a matrix. With that I can simply encode $P(M v_1, \ldots, M v_2)$ where $M$ is a square matrix constrained to have full rank and which therefore is invertible.</p>
<p>There is nothing particularly novel about my encoding, but there are many ways to approach this so I wanted to share my solution.</p>
<p>I assume that encoding field operations is no problem. Currently I’m working in the finite field $\mathbb F_2$, so encoding to propositional logic is trivial. When working in other fields using an SMT-Solver might be more convenient, although other finite fields can be encoded to propositional logic without too much hassle.</p>
<p>When we want to check the rank of a matrix by hand, we’re probably going to use Gaussian elimination to transform the matrix into row echelon form (with arbitrary non-zero leading entries). We then get the rank as the number of non-zero rows. The iterative nature of Gaussian elimination where we are swapping rows and adding multiples of rows to other rows doesn’t lead to a nice encoding. The naive way to encode this would require a copy of the whole matrix for each step of the algorithm.</p>
<p>What we can use instead, is the fact that Gaussian elimination effectively computes an <a href="https://en.wikipedia.org/wiki/LU_decomposition">LU decomposition</a> of a matrix. After performing Gaussian elimination on a matrix $A$, the result will be a matrix $U$ in row echelon form, so that there are matrices $P, L$ with $PA = LU$ where $P$ is a permutation matrix and $L$ is a lower unitriangular matrix (triangular with the diagonal all ones). The permutation matrix $P$ corresponds to the swapping of rows and the matrix $L$ corresponds to adding multiples of a row to rows below it. While Gaussian elimination may swap rows after already adding multiples of a row to it, it will never move a row $i$ above a row $j$ when a multiple of row $j$ was already added to row $i$. That explains why $L$ is still unitriangular even when we swap rows.</p>
<p>You might have noticed that I ignored the details for non-square matrices until now. I will assume that our matrix is wider than tall. As column rank and row rank are the same for a matrix, this is not a restriction as we can transpose a tall matrix to a wide. For the $PA = LU$ decomposition, with $A$ being a $m \times n$ matrix with $m \le n$ the matrix $P$ will be $m \times m$, the matrix $L$ will be $m \times m$ and $U$ will be $m \times n$.</p>
\begin{align}
\underbrace{\begin{pmatrix}
0 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 0
\end{pmatrix}}_{P}
\cdot
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 4 & 6 \\
1 & 1 & 1
\end{pmatrix}}_{A} =
\underbrace{\begin{pmatrix}
1 & 0 & 0 \\
2 & 1 & 0 \\
0 & 1 & 1
\end{pmatrix}}_{L}
\cdot
\underbrace{\begin{pmatrix}
1 & 1 & 1 \\
0 & 2 & 4 \\
0 & 0 & 2
\end{pmatrix}}_{U}
\end{align}
<p>In general the permutation matrix $P$ is not uniquely determined. In iteration $i$ there might be multiple rows $j \ge i$ with the fewest zeros on the left, so different choices will lead to different LU decompositions:</p>
\begin{align}
\underbrace{\begin{pmatrix}
0 & 1 & 0 \\
1 & 0 & 0 \\
0 & 0 & 1
\end{pmatrix}}_{P}
\cdot
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 4 & 6 \\
1 & 1 & 1
\end{pmatrix}}_{A} =
\underbrace{\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
\frac{1}{2} & -\frac{1}{2} & 1
\end{pmatrix}}_{L}
\cdot
\underbrace{\begin{pmatrix}
2 & 4 & 6 \\
0 & 2 & 6 \\
0 & 0 & 1
\end{pmatrix}}_{U}
\end{align}
<p>For full rank matrices, after we fix a $P$ though, the requirement of $L$ being unitriangular and $U$ being in row echelon form completely determines them. For lower rank matrices, some of the entries of $L$ don’t affect the result, so we need to force them to 0 if we want a unique decomposition.</p>
<p>This is already much nicer to encode. The property of a matrix being unitriangular, a permutation or in row echelon form have straight forward encodings. A matrix product is also easy to encode. Nevertheless we can improve a bit and get rid of the permutation matrix and get a uniquely determined decomposition instead.</p>
<p>To do this we need to relax the row-echelon form to something slightly less constrained. We need to perform row swaps to get a non-zero entry in the leftmost position possible. Assuming the matrix is full rank, we could just relax the constraint for the non-zero entry to be the leftmost of all remaining rows and instead take the rightmost non-zero entry in the current row. We still require that all entries below that non-zero entry are zero, but there might be other non-zero entries below <em>and</em> left of it. We wouldn’t even need to select the leftmost non-zero entry, and could instead select any non-zero entry. Choosing the leftmost makes the choice unique and adds a nice symmetry as we require all entries left of and below of it to be zero.</p>
<p>This corresponds to running Gaussian elimination on a suitably column-permuted matrix so that we never require row swaps. This works fine in the full rank case, but may not work for lower ranks. To get around that issue we simply allow and skip all-zero rows anywhere in the matrix. A succinct characterization of the resulting matrices is this: The first non-zero entry of each row is the last non-zero entry of its column. I’m not aware of a name for these matrices, if you know how they are called <a href="https://math.stackexchange.com/questions/3030147/name-of-matrices-where-the-first-non-zero-entry-of-each-row-is-the-last-non-zero">please let me now</a>.</p>
<p>With this we can write any square or wide matrix $A$ as $A = LU’$ where $U’$ is of this form and $L$ is lower unitriangular. If $A$ is full rank this is unique, otherwise it is unique up to the entries of $L$ that are multiplied just with zeros. The rank still corresponds to the number of non-zero rows in $U’$, although they may be anywhere in the matrix.</p>
\begin{align}
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 4 & 6 \\
1 & 1 & 1
\end{pmatrix}}_{A} =
\underbrace{\begin{pmatrix}
1 & 0 & 0 \\
2 & 1 & 0 \\
\frac{1}{2} & \frac{1}{2} & 1
\end{pmatrix}}_{L}
\cdot
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 0 & -6 \\
0 & 0 & 1
\end{pmatrix}}_{U'}
\end{align}
<p>Encoding this for SAT or SMT solver is now straightforward: We encode the required form of $L$ and the require form of $U’$, both simple constraints on non-zero or 1 entries, as well as the number of required non-zero rows of $U’$. The uniqueness of the decomposition ensures that the SAT solver will not spend time exploring the same matrix just expressed differently.</p>
<p><strong>Update:</strong> For full rank matrices you can also just ask for the existence of an inverse matrix, which is even simpler to encode, but has a different runtime behavior. As usual with SAT solvers it’s always worth to try different approaches as it’s hard to estimate which will be faster for a given problem.</p>Varisat 0.1.3: LRAT Generation and Proof Trimming
https://jix.one/varisat-0.1.3-lrat-generation-and-proof-trimming/
Fri, 14 Sep 2018 14:54:02 +0200me@jix.one (Jannis Harder)https://jix.one/varisat-0.1.3-lrat-generation-and-proof-trimming/<p>I’ve released a new version of my SAT solver <a href="https://jix.one/project/varisat">Varisat</a>. It is now split across two crates: one for <a href="https://crates.io/crates/varisat">library usage</a> and one for <a href="https://crates.io/crates/varisat-cli">command line usage</a>.</p>
<p>The major new features in this release concern the genration of unsatisfiability proofs. Varisat is now able to directly generate proofs in the <a href="https://www.cs.utexas.edu/~marijn/publications/lrat.pdf">LRAT</a> format in addition to the DRAT format. The binary versions of both formats are supported too. Varisat is also able to do on the fly proof trimming now. This is similar to running <a href="https://www.cs.utexas.edu/~marijn/drat-trim/">DRAT-trim</a> but processes the proof while the solver runs.</p>
<p>LRAT is an alternative to the DRAT format for unsatisfiability proofs. LRAT proofs are more verbose but faster and easier to check. This is because an LRAT proof contains the propagation steps needed to justify a learned clause, while DRAT requires the checker to rediscover them.</p>
<p>The usual way to generate an LRAT proof is to generate a DRAT proof first. This DRAT proof is then converted to an LRAT proof using DRAT-trim. I figured that it would be much faster to generate the LRAT proof directly from the SAT solver and was <a href="https://www.cs.utexas.edu/~marijn/publications/lrat.pdf#page=6">not convinced that the overhead or complexity of the implementation would be prohibitve</a>.</p>
<p>I still need to do more systematic benchmarking, but preliminary testing gave
promising results. The runtime for direct LRAT generation was often around or less than half the time needed for DRUP generation followed by conversion.</p>
<p>The code I added for direct LRAT generation made it also easy to incorporate a trimming feature similar to DRAT-trim but on the fly. Varisat can buffer a certain amount of proof steps and whenever the buffer is full it removes all steps leading only to deleted and unused clauses. I haven’t compared the effectiveness of this trimming approach to DRAT-trim but the runtime overhead is similar to direct LRAT generation.</p>Introducing Varisat
https://jix.one/introducing-varisat/
Sun, 20 May 2018 15:42:27 +0200me@jix.one (Jannis Harder)https://jix.one/introducing-varisat/<p>I’ve been interested in <a href="https://en.wikipedia.org/wiki/Boolean_satisfiability_problem#Algorithms_for_solving_SAT">SAT solvers</a> for quite some time. These are programs that take a boolean formula and either find a variable assignment that makes the formula true or find a proof that this is impossible. As many difficult problems can be rephrased as the satisfiability of a suitable boolean formula, SAT solvers are incredibly versatile und useful. I’ve recently finished and now released a first version of my SAT solver, <a href="https://crates.io/crates/varisat">Varisat</a>, on crates.io.</p>
<p>Most modern state of the art SAT solvers are based on the <a href="https://en.wikipedia.org/wiki/Conflict-Driven_Clause_Learning">conflict driven clause learning (CDCL)</a> algorithm. With some handwaving this algorithm could be seen as a clever combination of recursive search, backtracking, resolution and local search.</p>
<p>The CDCL algorithm uses a lot of heuristics and can be extended in many ways. This is where different CDCL based solvers take different approaches and where a lot of active research happens.</p>
<p>A few years ago I decided to write my own CDCL based SAT solver. I wanted to get an in depth understanding of the CDCL algorithm and also have a code base I’m familiar with so I can easily experiment with new ideas. I started writing several prototypes. First I used C++ and later I switched to Rust. Earlier this year I decided that my current prototype was good enough to turn into a complete, usable solver. Just in time to enter this year’s <a href="http://sat2018.forsyte.tuwien.ac.at/">SAT competition</a>.</p>
<p>As varisat is in an early stage of development, implementing little beyond the minimum
required for a modern CDCL based SAT solver, I don’t expect it to win any prizes in the competition. There are many problem instances that benefit a lot of additional techniques that varisat just doesn’t offer yet. Nevertheless I was pleasently surprised to find that it already is competitive for some graph coloring instances I needed to solve in the meantime.</p>
<p>Besides turning varisat into a library (command line only for now), I plan to incrementally add more and more of the proven techniques used by state of the art solvers. I also want to try some of my ideas for novel techniques and hope to find the time to write more about working on varisat.</p>Not Even Coppersmith's Attack
https://jix.one/not-even-coppersmiths-attack/
Sat, 23 Dec 2017 18:18:52 +0100me@jix.one (Jannis Harder)https://jix.one/not-even-coppersmiths-attack/<p>Earlier this year, in October, a new widespread cryptography vulnerability was announced.
The <a href="https://crocs.fi.muni.cz/public/papers/rsa_ccs17">initial announcement</a> didn’t contain details
about the vulnerability or much details on how to attack it (updated by now).
It did state the affected systems though: RSA keys generated using smartcards and similar devices that use Infineon’s RSALib.
The announcement came with obfuscated code that would check whether a public key is affected.
Also, the name chosen by the researchers was a small hint on how to attack it: “Return of Coppersmith’s Attack”.</p>
<p>I decided to try and figure out the details before the conference paper describing them would be released.
By the time the paper was released, I had reverse engineered the vulnerability and implemented my own attack, which did not use Coppersmith’s method at all.
This post explains how I figured out what’s wrong with the affected RSA-keys and how I used that information to factor affected 512-bit RSA-keys.</p>
<h2 id="reversing-the-vulnerability">Reversing the Vulnerability</h2>
<p>I started looking at the vulnerability, when a friend pointed me to a
deobfuscated version of the detection code:</p>
<blockquote>
<p>So this is the core of the Infineon RSA fail key detector: <a href="https://marcan.st/paste/MOEoh2EH.txt">https://marcan.st/paste/MOEoh2EH.txt</a> - this is very interesting indeed (and a huge fail).</p>
<p>– <a href="https://twitter.com/marcan42/status/921297567664652288">@marcan42</a> on twitter</p>
</blockquote>
<p>At that point the ROCA paper wasn’t published yet.
Figuring out how these keys are generated and how to attack them seemed like a
nice challenge.</p>
<p>The detection code gives a first hint on this.
It takes the public modulus $N$ and reduces it modulo a set of small
primes $\{p_0, p_1, p_2, \ldots, p_{m}\}$
.
For each prime $p_i$ it tests whether the remainder belongs to a set of allowed
remainders $R_i$.
If all remainders are in the corresponding set of allowed remainders, the key
is flagged as vulnerable.</p>
<p>The first few tests are:
\begin{align}
N \bmod 11 &\in \{1, 10\} \\
N \bmod 13 &\in \{1, 3, 4, 9, 10, 12\} \\
N \bmod 17 &\in \{1, 2, 4, 8, 9, 13, 15, 16\} \\
N \bmod 19 &\in \{1, 4, 5, 6, 7, 9, 11, 16, 17\} \\
N \bmod 37 &\in \{1, 10, 26\} \\
\vdots
\end{align}
</p>
<p>This doesn’t look good.</p>
<p>If you take a random RSA key or large prime and reduce it modulo small primes,
you’re expected to see all non-zero remainders evenly distributed.
You can’t get a zero, as that would mean there is a small prime factor.
For the full list of 17 small primes, only one of
$\prod_i \frac{p_i - 1}{|R_i|} \approx 2^{27.8}$ possible keys has this
property.</p>
<p>While an unintended loss 27.8 bits of entropy sounds bad as is, I assumed that
this is only a symptom of whatever went wrong when generating those keys.
While it would be possible to generate an RSA key from a uniform distribution
of keys like this, it would be slower and more complicated than the straight
forward correct way.
You’d also have to deliberately restrict the allowed remainders, which seemed unlikely.</p>
<p>To figure out the flaw in Infineon’s RSALib, let’s first look at properly
generating RSA keys.
[Disclaimer: Don’t use this blog post as reference for implementing this.]
The public modulus $N$ is the product of two large primes $P, Q$ of roughly
equal bit size.
You can constrain the bit size of $P, Q$ and $N$ by uniformly selecting primes
$P$ and $Q$ from a suitable interval $I$.</p>
<p>The easiest way to uniformly select an element with a given property $T$ in an
interval $I$ is rejection sampling:<sup class="footnote-ref" id="fnref:1"><a href="#fn:1">1</a></sup>
Uniformly select <em>any</em> element $x \in I$ (easy), check whether the property
$T(x)$ holds (hopefully easy), restart if it doesn’t.
The average number of iterations rejection sampling needs is inversely
proportional to the probability of a random $x \in I$ having the property.
The prime number theorem tells us that the probability of a random number
smaller than $N$ being prime is $\frac{1}{\log N}$.
For generating primes using rejection sampling this gets us a number of
iterations that grows linearly with the bit size.</p>
<p>This is already quite efficient, but can be optimized.
A simple way to halve the expected number of required iterations is to sample a
uniform odd number instead of any number within $I$.
A further improvement would be to uniformly sample an odd number that is not
divisible by three, but this already isn’t as straight forward anymore.
It’s possible to continue like this by constructing and uniformly sampling more
and more intricate subsets of $I$ that still contain all primes of $I$.
But it is also getting harder to correctly do this, while the possible speedup
is getting smaller and smaller.</p>
<p>This looks like a good place to screw up key generation, so let’s keep that in
mind and look at the RSALib generated keys again.
Assuming $P$ and $Q$ are generated independently and $N \bmod p_i \in R_i$,
there must be a $R’_i$ so that $P \bmod p_i \in R’_i$ and
$Q \bmod p_i \in R’_i$, i.e. $N$ can only be restricted modulo a small prime if
$P$ and $Q$ also are.
As any combination should be possible, we expect
$R_i = \{ab \bmod p_i \mid a, b \in R'_i \times R'_i \}$
.</p>
<p>Playing around with the numbers in the detection code quickly shows that
multiplying any two numbers in $R_i$ always results in another number in
$R_i \pmod{p_i}$.
Together with $1 \in R_i$ for all $R_i$, this lead me to assume $R’_i = R_i$.
I didn’t rule out other possibilities in general, but for
$R_0 = \{1, 10\}$ nothing else would work.</p>
<p>The next step was to identify what lead to the specific sets $R_i$.
We start with some observations:
As $R_i$ doesn’t contain zero, it is a <em>subset</em> of $\zzm{p_i}$, the
multiplicative group of integers modulo $p_i$.
We also discovered that $R_i$ is closed under multiplication modulo $p_i$.
This makes $R_i$ also a <em>subgroup</em> of $\zzm{p_i}$.
As $p_i$ is a prime, $\zzm{p_i}$ is a cyclic group, and thus $R_i$ is also a
cyclic group.
In particular this means there is a generator $a_i$, so that
$R_i = \{a_i^k \mid k \in \mathbb Z \}$
.</p>
<p>This is not exciting yet, but so far we only looked at what happens modulo
individual small primes.
I considered it much more likely that the RSALib code worked modulo the product
$M$ of several small primes.</p>
<p>At that point I was thinking this: If, modulo a small prime $p_i$, all possible
values are generated by an element $a_i$ that is not a generator for the whole
group $\zzm{p_i}$, could it be that, modulo $M = \prod_i p_i$, all possible
values are also generated by a single element $a$, so that $a_i = a \bmod p_i$?</p>
<p>This would be a bad idea, but we already know that someone went ahead with a
bad idea, so concerning our hypothesis it’s a point in favor.
So why is it a bad idea?
Generating a prime candidate $P$ so that $P \bmod M$ is in $\zzm{M}$ sounds like
a good idea.
It would exclude all values that have a common factor with $M$, and thus cannot
be prime, making our candidate more likely to be prime.
So far that’s not a problem.
The problem is sampling from $\zzm{M}$ by raising a single value $a$ to a
random power.
$\zzm{p_i}$ are cyclic groups, as $p_i$ is prime, and thus they do have a
generating element $b_i$.
It’s just not the $a_i$ used.
In general for composite $k$ the group $\zzm{k}$ is not cyclic, i.e. it is not
generated by a single element.
So whatever $a$ they used, it only generates a subgroup $R$ of $\zzm{M}$.
Even worse, that subgroup $R$ would be a lot smaller than
$R_0 \times R_1 \times … \times R_m$, as that group again isn’t cyclic.
The order of $R$ is given by $|R| = \lcm_i |R_i|$.
This can be seen by considering that
$a_i^k \equiv a_i^{k \bmod |R_i|} \pmod{p_i}$ and counting the possible
combinations of $k \bmod |R_i|$ and $k \bmod |R_j|$.</p>
<p>At this point only one in $\frac{|\zzm{M}|}{|R|} \approx 2^{69.95}$ possible
primes could be generated, but we haven’t validated our assumption yet.</p>
<p>Equipped with the test vectors that came with the original detection code, I
searched for a matching generator $a$ modulo a subset of the small primes.
I did this by combining all possible combinations of $a_i$ using the Chinese
remainder theorem (CRT).
I started with a small subset of the small primes, as this was much faster and
could falsify the hypothesis if no match was found.
As soon as $65537$ appeared as a candidate I knew that my guesses were right.
$65537 = 2^{16} + 1$ is a prime larger than our small primes, thus coprime to M
and would be a generator of $\zza{M}$ the <em>additive</em> group of integers modulo
M, which <em>is</em> a cyclic group.
Also multiplication with $2^{16} + 1$ can be very fast, especially on 8 and
16-bit microcontrollers.</p>
<p>Confusing the properties of $\zza{M}$ and $\zzm{M}$ could be an explanation of
why someone inexperienced with cryptography wouldn’t see a problem with this
approach.
It does not explain why someone was allowed to go ahead with their own way of
generating primes or why no one able to spot this mistake reviewed or audited this algorithm,
especially given the intended applications.</p>
<p>We’re not quite done identifying the vulnerability yet.
When looking at the set of small primes used, you can see that some primes are
skipped.
But what if they were only skipped in the deobfuscated and optimized detection
code, because $a$ happened to be a generator for $\zzm{p_i}$?
In fact when marcan published the deobfuscated code he mentioned that he
removed no-op tests.
Even if $a$ is a generator for $\zzm{p_i}$ we shouldn’t discard it, as the
cyclic subgroup $R$ generated by $a$ modulo $M$ is smaller than the product of
the individual subgroups $R_i$.</p>
<p>Using the test vectors I verified that for 512-bit keys the set of small primes
consists of all primes up to $167$.
Recomputing the size of the cyclic subgroup $R$ of $\zzm{M}$ shows that only
one in $\frac{|\zzm{M}|}{|R|} \approx 2^{154.89}$ possible primes can be
generated.
This loses more than half of the expected entropy.</p>
<h2 id="the-attack">The Attack</h2>
<p>Having so much information about the private key can be enough to very quickly
factor it.
Ignoring the kind of information we have, just counting the bits of entropy, it
could be possible to efficiently factor the key using variants of Coppersmith’s
method.
The CA in ROCA also stands for Coppersmith’s attack, but a straightforward
application isn’t possible.
While the entropy of the information we gained from this vulnerability is
enough, it doesn’t have the right form.</p>
<p>Coppersmith’s method is applicable if we know that a factor has the form $c +
kx$ for fixed $c$ and $k$, and $|x| < N^{\frac{1}{4}}$.
This is the case when we know consecutive bits of the binary representation or
know the factor modulo any number of a suitable size.
In our case we only know that the factors have the form $(a^i \bmod M) + Mx$
for small $i$ and $x$.
If we could afford to just bruteforce all possible values for $i$, we could
apply Coppersmith’s method, assuming $|x| < N^{\frac{1}{4}}$ holds.</p>
<p>There are $|R| \approx 2^{61.09}$ possible values for $i$.
So at first, this looks too expensive.
On the other hand, in this case $|x| < \frac{2^{257}}{M} \approx 2^{37.81} \ll N^{\frac{1}{4}}$,
so maybe it is possible to find some trade-off.</p>
<p>We are looking for a way to make $|R|$ and $M$ smaller, without making $M$ too
small.
Luckily this is easy: we can just ignore some of the small primes $p_i$.
This results in a smaller $M’$, just the product of the new primes
$\prod_i p’_i$, and a smaller $R’ \subset \zzm{M’}$.
As $|R’|$ depends on the common factors of $|R’_i|$, it can be a bit
difficult to find an optimal trade-off.</p>
<p>I implemented this attack using the Coppersmith implementation of
<a href="https://pari.math.u-bordeaux.fr/">PARI/GP</a>, but no matter what trade-off I
chose, my estimated runtime was much higher than the published one.
As this is the attack described in the ROCA paper, in retrospect, I think the
Coppersmith implementation I chose was just not optimized enough for this use
case.
In addition to that I might have missed the optimal choice for $M’$, but even a
single invocation of Coppersmith’s method was much slower for me.</p>
<p>This prompted me to try different approaches.
I hit many dead ends, until I came across an older but interesting attack for
factoring RSA keys with partial knowledge of a factor.
The attack was published in 1986 by Rivest and Shamir, the R and S in RSA.
The paper is called <a href="https://link.springer.com/chapter/10.1007/3-540-39805-8_3">“Efficient Factoring Based on Partial Information”</a>.</p>
<p>Compared to Coppersmith’s method it has the downside of needing not only half
the bits of a factor, i.e. $|x| < N^{\frac{1}{4}}$, but two thirds, i.e. $|x| <
N^{\frac{1}{3}}$.
This might seem bad at first, but as $M’$ grows faster than $|R’|$, the bruteforcing
work doesn’t increase as much as going from $N^{\frac{1}{4}}$ to
$N^{\frac{1}{3}}$ might suggest.
Also, I guessed that it would be so much faster than Coppersmith’s method,
that, for 512-bit keys, it would more than make up for that.</p>
<p>To understand why, we need to look at how the attack works.
The paper describes a slightly different scenario.
It assumes we know the factors have the form $x + 2^m c$.
This corresponds to knowing the most significant bits of the binary
representation of a factor.
The described attack works without modification for factors of the form $x +
Mc$ as it doesn’t make use of the fact that $2^m$ is a power of two.</p>
<p>If we assume $P = x + M c$ and $Q = y + M d$ with we get
\begin{align}
N &= xy + dxM + cyM + cdM^2.
\end{align}
We also assume that $0 \le x \le M$ and $0 \le y \le M$.</p>
<p>Let $t = N - cdM^2$, a constant we know, and we get
\begin{align}
t &= xy + Mdx + Mcy.
\end{align}
The paper then presents a heuristic argument, which is roughly this:
Because $xy$ is much smaller than $t$, $Mdx$ and $Mdy$ it is likely that
replacing $xy$ with $s$ and searching for the solution minimizing $s$ in
\begin{align}
t &= s + Mdx + Mcy
\end{align}
results in a solution to the original equation and thereby in a factorization of $N$.</p>
<p>This is a two-dimensional integer programming instance, i.e. a set of
integer linear constraints (the bounds for $x$ and $y$), an integer linear
objective (minimize $s = t - Mdx + Mcy$) and two integer unknowns ($x$ and
$y$).
It is then noted that integer programming in a fixed number of dimensions can
be solved in polynomial time.</p>
<p>The paper also mentions that a similar approach would work for knowing the
<em>least</em> significant bits of a factor.
This corresponds to $P = c + Mx$ and $Q = d + My$ with $0 \le x \le \sqrt{M}$
and $0 \le y \le \sqrt{M}$, which is exactly what we need.</p>
<p>In this case we get
\begin{align}
N &= cd + dxM + cyM + xyM^2 \\
t &= \frac{N - cd}{M} \\
t &= dx + cy + xyM.
\end{align}
</p>
<p>Again, we’d like to get rid of the $xy$ term, to make it a linear problem.
I did this by working modulo $M$:
\begin{align}
t &\equiv dx + cy \pmod{M}
\end{align}
</p>
<p>Usually for RSA-keys we know an upper bound for $P$ and $Q$, which together
with $N$ also translates to a lower bound.
From this we can compute bounds for $x$ and $y$.</p>
<p>Here I noticed, that it is possible to find a solution using an approach more
direct than integer programming.
The solutions to $t \equiv dx + cy \pmod{M}$ form a two-dimensional affine lattice.
To understand how we need to define lattices first.</p>
<p>Given a $n \times d$ matrix $B$ consisting of $d$ linear independent column vectors $B = (\mathbf b_0, \mathbf b_1, \ldots, \mathbf b_{d-1})$ the corresponding lattice $L$ is the set of integer linear combinations of these vectors:
\begin{align}
L = \{B \mathbf g \mid g \in \mathbb Z^d\}
\end{align}
As this set is closed under negation and addition, it forms a subgroup of $\mathbb R^n$.
Luckily we are working in two dimensions, which makes it easy to visualize lattices:</p>
<p><img src="lattice.svg" alt="Two-dimensional lattice example" class="large-figure">
The green dots are the lattice points and the red vectors are the basis vectors.</p>
<p>The basis vectors for a lattice are not unique, adding an integer multiple of
one basis vector to another generates the same lattice.
This is easy to see, as you can get the original lattice vector by subtracting
the same multiple again, so every integer linear combination of either basis is
also an integer linear combination of the other.
Negating a basis vector or exchanging the position of two vectors also doesn’t
change the generated lattice.
Performing an arbitrary number of those operations is equivalent to
multiplying the basis $B$ with an unimodular matrix $U$, i.e. an integer matrix $U$ with $|\det U| = 1$.
This makes sense as those matrices are exactly the integer matrices which have an integer inverse.</p>
<p><img src="equivalent.svg" alt="Two equivalent bases" class="large-figure"></p>
<p>$\mathbf b_0, \mathbf b_1$ and $\mathbf b’_0, \mathbf b’_1$ define the same lattice: $\mathbf b’_0 = \mathbf b_0 + \mathbf b_1, \mathbf b’_1 = \mathbf b_1 - 2\mathbf b’_0$.</p>
<p>An affine lattice is a lattice with an offset.
Given a basis $B$ and an offset vector $\mathbf o$ it consists of the lattice points
$A = \{B \mathbf g + \mathbf o \mid g \in \mathbb Z^d \}$.
This is not a group anymore, but adding an element of $L$ to $A$ gives another
point in $A$ and the difference of two points in $A$ is in $L$.</p>
<p>I claimed that the solutions to $t \equiv dx + cy \pmod{M}$ form an affine lattice.
Assume we have a single known solution $(x_0, y_0)$.
It’s not hard to see that adding multiples of $M$ to $x_0$ or $y_0$ is still a valid solution.
These solutions would form an affine lattice, using the basis vectors $(M, 0)$ and $(0, M)$, but that lattice would not contain all solutions.
We know that $c$ and $d$ are coprime to $M$, otherwise $P$ or $Q$ would have a small factor.
This means that we should have a solution $(x, \frac{t - dx}{c})$ for any value of $x$.
Taking the difference of two solutions with consecutive $x$ gives us a basis vector $\mathbf b_0 = (1, \frac{-d}{c})$.
Together with $\mathbf b_1 = (0, M)$ and $\mathbf o = (0, \frac{t}{c})$ this defines an affine lattice containing all solutions.</p>
<p>Given this affine lattice, we’re interested in the lattice points within the region defined by our bounds for $x$ and $y$.
If we can find a lattice point closest to a given arbitrary point, we could compute the closest point to the center of that region.
In general for arbitrary lattice dimensions that problem is NP-complete.
Luckily for two dimensional lattices this is very efficient.</p>
<p>The difficulty in finding a closest lattice points stems from the fact that basis vectors can point in roughly the same or opposite direction.
In fact for our affine lattice of solutions to $t \equiv dx + cy \pmod{M}$, the basis vectors we derived point in almost the same direction.
Let’s, for a moment, assume the opposite would be the case and that the basis vectors are orthogonal.
We could then just represent a point as a non-integer multiple of the basis vectors and individually round the multiples to the nearest integer.
As moving along one basis vector direction doesn’t affect the closest multiple in the other directions, we would get the nearest point.</p>
<p>When the basis vectors aren’t exactly orthogonal but close, it is possible to bound the distance when approximating the nearest point by independent rounding in each basis vector direction.
Consider the two-dimensional case: rounding in direction of $\mathbf b_0$ moves the point by $\mu_{0,1} = \frac{\sp{\mathbf b_0}{\mathbf b_1}}{\norm{b_1}^2}$ times the length of $\mathbf b_1$ in the direction of $\mathbf b_1$.
The value of $\mu_{0,1}$ is the (signed) length of $\mathbf b_0$ projected onto $\mathbf b_1$, divided by the length of $\mathbf b_1$.</p>
<p><img src="projection.svg" alt="Projection example" class="large-figure">
The orange vector is the projection of the blue vector onto the red vector. It is equal to $\mu$ times the red vector.</p>
<p>This is great because we can find an equivalent basis so that $|\mu_{0,1}|$ and $|\mu_{1,0}| \le \frac{1}{2}$.
This is done using the Lagrange-Gauss algorithm, which finds the shortest basis for a two-dimensional lattice.
It works similar to the Euclidean algorithm for computing the greatest common divisor of two numbers by repeatedly reducing one value by the other.
Let $\lfloor x \rceil$ be the closest integer to $x$.
If $|\mu_{0,1}| > \frac{1}{2}$ the vector $\mathbf b_1 - \lfloor \mu_{0,1} \rceil \mathbf b_0$ is shorter than $\mathbf b_1$ and can replace it.
The same is true with exchanged basis vectors and $\mu_{1,0}$.
Replacing one lattice vector with a shorter one like this can be iterated until neither $|\mu_{0,1}|$ nor $|\mu_{1,0}|$ are greater than $\frac{1}{2}$.
For basis vectors with integer components the number of iterations needed grows logarithmically with the length of the basis vectors, i.e. linear with their bit size.</p>
<p>For such a two-dimensional basis, finding a close point by rounding the multiples of the basis vectors results in a very small distance bound.
I haven’t computed the exact bound, but rounding introduces an offset of at most half a basis vector in each basis vector direction.
This would introduce an error of at most a quarter basis vector each, which is enough to cross the midpoint between two multiples, but not enough to go further.
In practice, for the lattices we’re interested in, the point found by rounding happens to also be the closest point.</p>
<p>With this we can find a solution to $t \equiv dx + cy \pmod{M}$ which is closest to the midpoint of the rectangle defined by the bounds for $x$ and $y$.
Most of the time this solution is the unique solution within bounds that leads to a factoring of $N$ if such a solution exists.
Sometimes, though, when one basis vector is particularly short, there are multiple solutions within bounds.
Luckily it seems that this only happens when the other basis vector is long.
This means that all solutions within bounds lie on a single line.
In that case a solution can be efficiently found or shown to not exist by recursively bisecting the line and checking whether a point that factors $N$ can exist between the endpoints.</p>
<p>This gives us a complete algorithm to check a single candidate.
Together with an optimized value for $M’$ and an outer loop that bruteforces through the $|R’|$ possible guesses, this allows me to break a ROCA-vulnerable 512-bit RSA key in less than $900$ seconds using a single thread on my laptop.
As the outer loop can be trivially parallelized, breaking those keys on a more powerful server with many threads takes less than $30$ seconds.
I’ve also looked at using this approach for 1024-bit keys, but a rough estimate put the runtime far above the runtime of the ROCA attack.
For larger keys it is even worse, so I didn’t pursue that path.</p>
<h2 id="source-code">Source Code</h2>
<p>I’ve decided to release the <a href="https://gitlab.com/jix/neca">source code</a> of my attack implementation.
It’s implemented in C++ and uses <a href="https://gmplib.org/">GMP</a> for most bignum arithmetic, except inside the lattice reduction where custom routines are used.
It includes some low-level optimizations that I’ve glossed over, for example using floats to approximate bignums while keeping track of the accumulated error.</p>
<p>Feel free to contact me if you want to know more about specific parts or about the implementation.
I’ll also be at the <a href="https://events.ccc.de/congress/2017/wiki/index.php/Main_Page">34c3</a> and am happy to have a chat with anyone interested in this or related things.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
TeX: {Macros: {
ord: "\\mathop{\\rm ord}",
lcm: "\\mathop{\\rm lcm}",
zzm: ["\\mathbb Z_{#1}^*", 1],
zza: ["\\mathbb Z_{#1}^+", 1],
sp: ["\\langle #1, #2 \\rangle", 2],
norm: ["\\lVert #1 \\rVert", 1]
}}
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Rejection sampling also allows for non-uniform source and target distributions, but simplifies to the described algorithm for a uniform source distribution and a target distribution that is uniform among all values of a given property and zero otherwise.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>Pushing Polygons on the Mega Drive
https://jix.one/pushing-polygons-on-the-mega-drive/
Tue, 16 May 2017 20:37:45 +0200me@jix.one (Jannis Harder)https://jix.one/pushing-polygons-on-the-mega-drive/<p>This is a write-up of the polygon renderer used for the Mega Drive demo <a href="https://www.pouet.net/prod.php?which=69648">“Overdrive 2”</a> by Titan, released at the <a href="https://2017.revision-party.net/">Revision 2017</a> Demoparty.
As the Mega Drive can only display tilemaps, not bitmaps, and does not have the video memory mapped into the CPU address space, this turned out to be an interesting problem.
If you have not seen the demo yet, I recommend watching it before continuing.
You can find a <a href="https://youtu.be/gWVmPtr9O0g">hardware capture on YouTube</a>:</p>
<div style="position: relative; padding-bottom: 56.25%; padding-top: 0; height: 0; overflow: hidden; margin: 1em 0;">
<iframe src="https://jix.one/youtube/gWVmPtr9O0g" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" allowfullscreen frameborder="0" title="YouTube Video"></iframe>
</div>
<p>Ironically, the 3D shown in YouTube’s preview screenshot is not done using the renderer described here; the Mega Drive and ship scenes are, though.</p>
<h2 id="3d-renderer-or-video-player">3D Renderer or Video Player?</h2>
<p>After our demo was shown at Revision, people naturally started speculating about how we realized several of the effects.
Some quickly concluded that a complete 3D renderer in fullscreen, with shading and at that framerate seems implausible and I would not disagree.
An alternative theory was that it is just streaming the frames (as nametable and patterns) from the cartridge.
That would certainly work and would actually be faster.
It has the huge downside of taking way too much ROM space though.
A back-of-the-envelope calculation,
assuming deduplication of solid-colored tiles and a slightly higher framerate,
ends up at roughly half of our 8MB ROM being used for 3D frames.
That is about ten times as much as the 3D scenes are using now.
Together with the large amount of PCM samples used by Strobe’s awesome soundtrack that would not leave much for the other effects or Alien’s beautiful graphics.
While there are some ways to compress frames a bit further, we decided on something more suitable for the 3D scenes.</p>
<h2 id="vector-animations">Vector Animations</h2>
<p>If a complete 3D renderer is not possible, it might be a good idea to pre-compute parts of it and only do the final steps on the Mega Drive.
That would be rasterization of projected, shaded and clipped polygons.
The data needed to describe a frame at that stage is quite small.
It consists of the polygon vertices in fixed-point screen coordinates and a palette index for the color.
Fixed-point is needed as rounding to integer coordinates looks wobbly when animated.
Choosing a suitable palette for each frame can be done during the preprocessing.
Storing it takes just a few bytes.
This is a good starting point, but now we need to figure out how to make drawing polygons fast.</p>
<p>There are three optimizations used to speed up polygon rendering:</p>
<ol>
<li>Avoiding overdraw,</li>
<li>Drawing from left to right,</li>
<li>Quickly drawing tiles.</li>
</ol>
<p>I will go through them in order and explain the details.</p>
<h2 id="flattening">Flattening</h2>
<p>There are two problems with just a list of projected polygons.
First, the order in which they are drawn matters.
Assuming no intersecting polygons, sorting them from back to front gives the correct result.
But this still leaves us with a second problem: overdraw.
This is not a problem in the sense that the rendering breaks, but rather in that a pixel is wastefully drawn multiple times, discarding the previous value.
Having to draw into a tilemap amplifies this problem.</p>
<p>The solution is to split all polygons that intersect or overlap, throwing away any resulting polygon that is occluded by another polygon.
This leaves us with a partition or, more specifically, a tessellation of the view plane.
As a further optimization adjacent polygons that have the same color can be joined, as their common edge(s) are not visible.
This can result in quite complex, even non-convex, polygons.
Apart from a small exception described in the next section, this is not a problem though.</p>
<p><img src="flattening.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<h2 id="drawing-from-left-to-right">Drawing from Left to Right</h2>
<p>As we now have a tessellation of the view plane, drawing all polygons would compute and draw each edge twice; once for each adjacent polygon.
This can be avoided.
If we draw polygons strictly from left to right, we can use a single table to store all edges that have just one adjacent polygon drawn.
The table is just the x-position of the rightmost drawn edge for each scanline.
I am calling that table the “fringe table”.
It is initialized with all zeros, i.e. the left edge of the view.</p>
<p>There is a small problem though with some non-convex polygons.
If a polygon is U-shaped it is impossible to draw strictly from left to right; the enclosed polygon would have to go in between.
Rotated by 90 degree, i.e. a C-shaped polygon, it is not a problem though.
This is solved by breaking U-shaped polygons into multiple polygons that do not have gaps in any scanline.
As a small optimization those breaks are preferably inserted on a tile boundary.
Why this is an advantage will become clear later.</p>
<p><img src="breaking.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<p>When drawing polygons like this, there is no need to even store the left side of a polygon.
Whenever a polygon is drawn, the left edge is already stored in the fringe table.
So apart from the computation time saved, we also save storage space.</p>
<p>So far we only avoided re-computing the left edges of a polygon; we still need to draw them.
In fact there is no way to avoid drawing the left edges, but we can save time drawing the right edges.
As we know anything beyond the right edges will be overdrawn, we do not need to be exact while drawing.
An easy way to save some time while drawing into a tile map is to completely fill any tile containing a right edge, leaving it to the adjacent polygon to draw the exact edge.
This brings us to the next optimization.</p>
<h2 id="quickly-drawing-tiles">Quickly Drawing Tiles</h2>
<p>The final challenge to overcome is efficiently drawing to a tilemap.
As a first step the line drawing is decoupled from the handling of the tilemap.
This is done by introducing a copy of the fringe table, called the “outline table”.
In between drawing polygons those two tables are always the same.
The line drawing routine updates the outline table to contain the right side of the polygon.
This is done for all right side edges of the polygon before any actual drawing to the tilemap happens.
Afterwards the polygon to draw is exactly the area delimited to the left by the fringe table and to the right by the outline table.</p>
<p><img src="outline-setup.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<p>The line drawing routine also outputs the topmost and bottommost y-coordinates of the polygon.
Those are rounded outwards to the next tile boundary, i.e. a multiple of 8 pixels.
This is safe to do, as the fringe and outline table are identical for scanlines outside of the polygon area,
indicating that nothing should be drawn there.
This allows us to process the polygon tile-row by tile-row without a special case for partial tile-rows.</p>
<p>The tile-rows are processed from top to bottom.
First we compute three x-coordinates for each tile-row.
The leftmost and rightmost value in the fringe as well as the rightmost value in the outline of the tile-row.
Those span the area where the polygon needs to be drawn and also divide the tile-row into two segments.
The left segment contains edges of already drawn polygons while the right segment does not.
To avoid special cases, those values are rounded to tile boundaries too.</p>
<p><img src="segments.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<p>Both segments are then processed tile by tile from left to right.
The left segment is drawn before the right segment, but we will look at the right one which does not contain left edges or any already drawn polygons first.
All tiles in the right segment can be completely filled with the color of the current polygon.
This is legal as it will not draw over existing polygons and will only overshoot to the right.
The overshoot will be fixed by drawing subsequent polygons.
To speed this up even more, we precompute a solid-colored pattern for each of the 16 colors.
This means we can draw an 8x8 tile in the right segment by updating a single word in the nametable to point to the precomputed pattern.</p>
<p>Before this happens, though, the left segment is drawn.
Although we draw it first, we can be sure that every tile of the left segment was a tile of a right segment of a previous polygon.
This might sound counterintuitive, but it is possible as the left segment can consist of zero tiles, which it will for the leftmost polygons.</p>
<p>For each tile of the segment there are two cases we need to consider.
One is that the tile was only drawn as part of a right segment so far,
the other is that it was also part of one or more polygon’s left segments.</p>
<p>In the first case, the nametable entry for the tile points to one of the precomputed solid patterns.
In the second case, the nametable entry points to an individual pattern just for this tile.</p>
<p>If the tile was solid so far, we need change the nametable to point to an individual pattern.
This is done by using a simple bump allocator that allocates a continuous range of tiles.
Having a fixed pattern address for each tile would probably be faster here, but it would also mean that the used patterns are scattered throughout memory.
This is a huge downside on the mega drive as the VRAM is not memory mapped.
In fact, while drawing a frame, we are not updating the pattern data or nametable at all, but a shadow copy in work memory.
After we are done, a DMA transfer is used to quickly copy it over to VRAM.
At that point having a compact consecutive memory area containing all patterns saves a lot more cycles than using fixed pattern addresses here.</p>
<p>After allocating a pattern, we need to draw it.
A newly allocated pattern always needs to be drawn in two colors.
The color of the previously solid tile and the color of the current polygon.
Each line of the pattern will have the old color on the left and the new color on the right, potentially consisting of just one of those colors.
The fringe table tells us where the polygon edge is, i.e. where the color change must be.</p>
<p>Conveniently, a line of a pattern, consisting of 8 pixels, 4 bit each, perfectly fits into a 32 bit register of the 68000 CPU.
This allows us to apply a mask telling us where each color is supposed to go using a single <code>and.l (a0, d0), d1</code> instruction.
The register <code>d0</code> contains the value coming from the fringe table (multiplied by 4 to do long-word-wise addressing) and <code>d1</code> contains the data to mask.
The register <code>a0</code> points into a special table.
The table looks like this, where each line are the bits of a 32-bit long word.</p>
<pre><code class="language-plain">11111111111111111111111111111111
11111111111111111111111111111111
... many repetitions ...
11111111111111111111111111111111
11111111111111111111111111111111
00001111111111111111111111111111
00000000111111111111111111111111
00000000000011111111111111111111
00000000000000001111111111111111
00000000000000000000111111111111
00000000000000000000000011111111
00000000000000000000000000001111
00000000000000000000000000000000
00000000000000000000000000000000
... many repetitions ...
00000000000000000000000000000000
00000000000000000000000000000000
</code></pre>
<p>Depending on the x-coordinate of the tile <code>a0</code> points to a different position in that table.
By padding the table with enough all-one or all-zero words, there is no need to do any clipping to the tile boundary, which greatly speeds up drawing.
Together with a smaller table containing solid colored lines and some bit twiddeling this completes the newly allocated pattern drawing routine.</p>
<p>In the case of a pattern that was already allocated, the same approach is used.
The only difference is that the mask isn’t used to mask between two colors, but between the old data of a line and the new color.</p>
<p>After completing a row the corresponding entries of the outline table are copied into the fringe table and the next tile row is processed.
When all tile rows are drawn the fringe table is the same as the outline table again, ready for the next polygon to be drawn.</p>
<p><img src="polygon-complete.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<h2 id="putting-it-all-together">Putting it All Together</h2>
<p>This concludes the description of our polygon renderer routine.
You can see the Mega Drive implementation in action below.
This animation was captured from an emulator running a patched version of the routine.
The patched version copies the nametable and pattern data into VRAM after every polygon and waits for the next frame.
The palette is updated in the end, resulting in the false colors while drawing is in progress.
The garbage tiles that appear sometimes are nametable entries that were not yet touched in the current frame.
Those entries might point to already allocated and redrawn patterns.</p>
<p><img src="animated.gif" alt="Flattening example using two cubes" class="figure"></p>
<p>As an overview for anyone who wants to implement this or a similar routine I’ve summarized it using pseudocode:</p>
<pre><code class="language-plain">routine draw_polygon(right_edges, color):
foreach edge in right_edges:
update outline, min_y and max_y using line drawing routine
round min_y and max_y to tiles
for each row within min_y ... max_y:
min_fringe = min(fringe[y] for y in current row)
max_fringe = max(fringe[y] for y in current row)
max_outline = max(outline[y] for y in current row)
round min_fringe, max_fringe and max_outline to tiles
for each column within min_fringe ... max_fringe:
column_line_table = line_table + column * 8 entries
if nametable[column, row] is solid:
old_color = color of nametable[column, row]
pattern = nametable[column, row] = alloc_pointer
increment alloc_pointer
old_pixels = color_table[old_color]
new_pixels = color_table[color]
for y in current row:
mask = column_line_table[fringe[y]]
pattern[y] =
(new_pixels & mask) | (old_pixels & ~mask)
else:
pattern = nametable[column, row]
new_pixels = color_table[color]
for y in current row:
old_pixels = pattern[y]
mask = column_line_table[fringe[y]]
pattern[y] =
(new_pixels & mask) | (old_pixels & ~mask)
for each column within max_fringe ... max_outline:
nametable[column, row] = solid pattern for color
for y in current row:
fringe[y] = outline[y]
</code></pre>
<p>The actual implementation consists of around 600 lines of 68000 assembly, making some use of macros and repeat statements.
The preprocessor was implemented in <a href="https://www.rust-lang.org/">Rust</a>, which I can highly recommend.
The implementation and fine-tuning took somewhere around 3 weeks plus some evening coding.
Most of it was done during the last summer.
Coming up with and improving the concept behind this was done on and off over many years, not targeting the mega drive in particular.</p>
<p>Working on and releasing Overdrive 2 was an awesome experience.
I want to thank everyone involved for making this possible.</p>