Blog on Jix' Site
https://jix.one/
Recent content in Blog on Jix' SiteHugo -- gohugo.ioen-usme@jix.one (Jannis Harder)me@jix.one (Jannis Harder)Sun, 03 Feb 2019 17:37:42 +0100Refactoring Varisat: 1. Basics and Parsing
https://jix.one/refactoring-varisat-1-basics-and-parsing/
Sun, 03 Feb 2019 17:37:42 +0100me@jix.one (Jannis Harder)https://jix.one/refactoring-varisat-1-basics-and-parsing/<p>This is the first post in a series of posts I plan to write while refactoring my
SAT solver varisat. In the progress of developing varisat into a SAT solver that can <a href="http://sat2018.forsyte.tuwien.ac.at/index.php?cat=results">compete with some well known SAT solvers</a> like minisat or glucose, it accumulated quite a bit of technical debt. Varisat is the first larger project I’ve written in rust and there a lot of things I’d do differently now. Before I can turn varisat into a solver that competes with the fastest solvers out there, I need to do some refactoring.</p>
<p>My current plan is to start a new project from scratch copying over bits that I want to keep and rewriting parts that I don’t. That’s usually my preferred way to refactor when I plan to change the overall architecture. The new version will be varisat 0.2, hopefully turning into varisat 1.0.</p>
<p>This refactoring also gives me the chance to write this series of posts, which should make it much easier to understand the code base and contribute to varisat. I’m also moving <a href="https://github.com/jix/varisat">varisat to GitHub</a> and stable rust which should make collaboration within the rust open source ecosystem easier.</p>
<p>I also want to use my new library <a href="https://jix.one/introducing-partial_ref/">partial_ref</a> which should result in way less fighting the borrow checker. Currently in varisat there are a lot of functions that take way too many parameters, mostly references to different data structures. This is caused by the borrow checker being not flexible enough across function calls. My partial_ref library offers a workaround that I think will be an improvement compared to the workarounds I’ve been using before.</p>
<h2 id="cnf-formulas">CNF Formulas</h2>
<p>SAT solvers determine whether a Boolean formula can be satisfied. They either find an assignment (also called interpretation) of the formula’s variables so that the formula is true, or produce a proof that this is impossible. Usually SAT solvers require the input to be in <a href="https://en.wikipedia.org/wiki/Conjunctive_normal_form">conjunctive normal form</a> (CNF). This means that the formula is a conjunction (Boolean and) of clauses, where a clause is a disjunction (Boolean or) of literals and a literal is a variable or a negated variable. An assignment satisfies a formula in CNF precisely when at least one literal of each clause is true.</p>
<p>SAT solvers require the input to be in CNF as this is the internal representation used. This isn’t a big restriction though, as it is possible to turn any Boolean formula into an equisatisfiable formula in CNF with <a href="https://en.wikipedia.org/wiki/Tseytin_transformation">only linear overhead by introducing new variables</a>. Equisatisfiable means that either both formulas are satisfiable or both are not. This is weaker condition than equivalence which means that exactly the same assignments satisfy both formulas. Here equisatisfiability allows the introduction of new helper variables.</p>
<p>In varisat 0.1, a formula is directly parsed into the internal data structures of the solver. There is no standalone data type representing a CNF formula. Such a data type isn’t very useful inside the solver. More specialized data structures are used there. Nevertheless, I think such a type would be useful when using varisat as a library. It makes it easier to re-use the parser or write other code that processes CNF formulas.</p>
<p>To represent a formula in CNF we need a type for variables and for literals. Variables are indexed using integers and are represented by their index. This is also the encoding used by the <a href="https://www.satcompetition.org/2009/format-benchmarks2009.html">standard CNF file format (DIMACS CNF)</a>. Internally for varisat the first variable has index 0, while in DIMACS CNF the first variable has index 1. For everything user facing the 1-based DIMACS CNF encoding will be used.</p>
<p>A literal is represented by their variable’s index and a flag that tells us whether the literal is negated or not. In DIMACS CNF the flag is represented by negating the variable index. That doesn’t work for the variable with index 0 though. So to avoid that problem, internally literals use the least significant bit as a negation marker, shifting the variable index one bit to the left.</p>
<p>To save on memory usage and bandwidth, literals and variables are stored in 32-bit. This limits the number of variables to $2^{31}$. For now the actual limit in varisat is set quite a bit below $2^{31}$, leaving room for further flags or sentinel values. As far as I know, most SAT solvers do this.</p>
<p>For variables and literals I’m quite happy with the existing code from varisat 0.1, providing the types <code>Var</code> and <code>Lit</code>, so I added it almost <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/lit.rs">verbatim to the new project</a>.</p>
<p>Equipped with literals we can now implement a type to store a CNF formula. It would be possible to just use a <code>Vec<Vec<Lit>></code>, but that requires an allocation for each clause. Given that formulas with millions of clauses are used in practice, that doesn’t sound so good. Instead I’m going to use a struct with a <code>Vec<Lit></code> containing the literals for all clauses and a <code>Vec<Range<usize>></code> containing the range where each clause’s literals are stored. You can <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/cnf.rs">see the implementation here</a>.</p>
<h2 id="parsing">Parsing</h2>
<p>In varisat 0.1 I tried to make the parser as forgiving as possible. It also completely ignored the header line. I think that was a mistake. It added complexity, causing a long standing bug when parsing empty clauses and made it less likely to detect formulas that somehow got truncated. I still don’t require a header, but if one is present and the formula doesn’t match the header, at least a warning is generated.</p>
<p>The <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/dimacs.rs">parsing code itself</a> is a hand-rolled parser that is largely based on the one in varisat 0.1. You can feed it chunks (byte slices) of input data. While parsing a chunk, clauses are added to a parser internal CNF formula. At any point it is possible to retrieve the clauses parsed so far, clearing the internal CNF formula. This allows for incremental parsing, which is useful for the solver with its own clause database, but also allows for parsing a complete file into a single CNF formula, useful for various utilities. Varisat 0.1 also used incremental parsing, but instead of the caller asking for the clauses parsed so far, it required a callback to process clauses. Combined with error handling that approach wasn’t nice to use.</p>
<p>Also new is some code to write a CNF formula back into a file. This is useful for writing various CNF processing utilities, but even more important for testing.</p>
<h2 id="testing">Testing</h2>
<p>While varisat 0.1 had some tests, my plan is to get much better test coverage for varisat 0.2. I had quite a few bugs that stayed undetected way too long. To make testing easier I’ll be using property based testing for varisat 0.2. I’ve been using property based testing before, for example using <a href="http://hackage.haskell.org/package/QuickCheck">QuickCheck</a> in Haskell or using <a href="https://hypothesis.works/">Hypothesis</a> in Python. Recently I’ve discovered the excellent <a href="https://crates.io/crates/proptest">proptest</a> crate, which is inspired by Hypothesis.</p>
<p>Property based testing changes the focus from individual test cases to more general properties. You specify a set of values, using combinators provided by the library, and some property that should hold for those values. The property is written as normal rust code using assertions. The library will then sample lots of values matching your specification and test them against your property, often finding non-obvious corner cases in the process. In a way it is a hybrid between <a href="https://en.wikipedia.org/wiki/Fuzzing">fuzzing</a> and unit tests. When a counterexample is found, propcheck also systematically tries to find simpler counterexamples by shrinking the values. It also saves counterexamples for future regression testing.</p>
<p>The CNF parser and writer are a good <a href="https://github.com/jix/varisat/blob/0369c9fa12ff6d8f4a378a65b58e969cd2cb6c7b/varisat/src/dimacs.rs#L535-L543">example for this</a>. By generating random CNF formulas, writing them, parsing them back and comparing the result, a lot of code paths are exercised and tested with very little effort.</p>
<p>I usually try to write a generic property test first, and then use a code coverage tool to identify what else needs to be tested. In the case of this parser what was left was mostly the checks for invalid syntax and integer overflows.</p>
<h2 id="what-s-next">What’s next?</h2>
<p>The next thing to add is a clause database, probably followed by unit propagation. After that there will be conflict analysis, branching heuristics, learned clause minimization, glue computation, database reduction, database garbage collection, restarts and proof generation, although not in that order. If I’m done with that I’ll be roughly at a point where I am now with varisat 0.1, but hopefully with a much cleaner code base.</p>
<p>I plan to cover the complete refactoring process on this blog. I’m not sure how regular and in what detail, but my goal is that someone not familiar with SAT solver internals can use this series of posts as a starting point for hacking on varisat.</p>
<p>If you don’t want to miss future posts, you can <a href="https://jix.one/index.xml">subscribe to the RSS feed</a> or follow <a href="https://twitter.com/jix_">me on Twitter</a>.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>Introducing partial_ref
https://jix.one/introducing-partial_ref/
Mon, 24 Dec 2018 14:07:10 +0100me@jix.one (Jannis Harder)https://jix.one/introducing-partial_ref/<p>Recently there has been some discussion about <a href="http://smallcultfollowing.com/babysteps/blog/2018/11/01/after-nll-interprocedural-conflicts/">interprocedural borrowing conflicts</a> in rust. This is something I’ve been fighting with a lot, especially while working on my SAT solver <a href="//project/varisat">varisat</a>. Around the time Niko Matsakis published his blog post about this, I realized that the existing workarounds I’ve been using in varisat have become a maintenance nightmare. Making simple changes to the code required lots of changes in the boilerplate needed to thread various references to the places where they’re needed.</p>
<!-- more -->
<p>While I didn’t think that a new language feature to solve this would be something I’d be willing to wait for, I decided to sit down and figure out how such a language feature would have to look like. I knew that I wanted something that allows for partial borrows across function calls. I also prefer this to work with annotations instead of global inference. While trying to come up with a coherent design that fits neatly into the existing type and trait system, I realized that most of what I wanted can be realized in stable rust today.</p>
<p>Luckily some time ago I came across the <a href="https://crates.io/crates/frunk">frunk</a> crate. From there I learned a trick that I’d call inference driven metaprogramming. Rust requires trait implementations to be unambiguously non-overlapping. The rules for this just consider the implementing type, not any bounds. The trick I’ve learned from frunk is to add an additional type parameter to the trait that would otherwise have overlapping implementations. That type parameter is only used to disambiguate the implementations. As long as there is only one implementation, but this time considering bounds, rust’s powerful type inference will infer that extra type parameter. An example would be frunk’s <a href="https://docs.rs/frunk/0.2.2/frunk/hlist/trait.Plucker.html"><code>Plucker</code></a> trait, where the <code>Index</code> type parameter selects between the otherwise overlapping instances.</p>
<p>Equipped with this, I was able to implement type-level borrow checking logic. Today I’ve released a <a href="https://crates.io/crates/partial_ref">first version of this</a>. The documentation contains a <a href="https://docs.rs/partial_ref/0.1.0/partial_ref/#tutorial">small tutorial</a>. I also documented the <a href="https://github.com/jix/partial_ref/blob/1c201f929f99d363b6cab326e483be72e3f51774/partial_ref/src/lib.rs#L623">type-level borrow checking logic</a>, as the involved types and traits will appear in error messages. While I tried to optimize for readable error messages, I think every trait and type that could be part of one should be documented.</p>
<p>Using the library looks like this (see the tutorial for an explanation):</p>
<pre><code class="language-rust">use partial_ref::*;
part!(pub Neighbors: Vec<Vec<usize>>);
part!(pub Colors: Vec<usize>);
part!(pub Weights: Vec<f32>);
#[derive(PartialRefTarget, Default)]
pub struct Graph {
#[part = "Neighbors"]
pub neighbors: Vec<Vec<usize>>,
#[part = "Colors"]
pub colors: Vec<usize>,
#[part = "Weights"]
pub weights: Vec<f32>,
}
let mut g = Graph::default();
let mut g_ref = g.into_partial_ref_mut();
g_ref.part_mut(Colors).extend(&[0, 1, 0]);
g_ref.part_mut(Weights).extend(&[0.25, 0.5, 0.75]);
g_ref.part_mut(Neighbors).push(vec![1, 2]);
g_ref.part_mut(Neighbors).push(vec![0, 2]);
g_ref.part_mut(Neighbors).push(vec![0, 1]);
pub fn add_color_to_weight(
mut g: partial!(Graph, mut Weights, Colors),
index: usize,
) {
g.part_mut(Weights)[index] += g.part(Colors)[index] as f32;
}
let (neighbors, mut g_ref) = g_ref.split_part_mut(Neighbors);
let (colors, mut g_ref) = g_ref.split_part(Colors);
for (edges, &color) in neighbors.iter_mut().zip(colors.iter()) {
edges.retain(|&neighbor| colors[neighbor] != color);
for &neighbor in edges.iter() {
add_color_to_weight(g_ref.borrow(), neighbor);
}
}
</code></pre>
<p>I have a bunch of additional features planned, but the next thing I want to do is to refactor my SAT solver to use this library.</p>
<p>I also hope that this library can be used to experiment with partial borrowing to gather experience for a possible future language extension.</p>
Encoding Matrix Rank for SAT Solvers
https://jix.one/encoding-matrix-rank-for-sat-solvers/
Fri, 07 Dec 2018 19:59:25 +0100me@jix.one (Jannis Harder)https://jix.one/encoding-matrix-rank-for-sat-solvers/<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>
<p>I’m working on a problem where I want to use a SAT solver to check that a property $P(v_1, \ldots, v_n)$ holds for a bunch of vectors $v_1, \ldots, v_n$, but I don’t care about the basis choice. In other words I want to check whether an arbitrary invertible linear transform $T$ exists so that the transformed vectors have a certain property, i.e. $P(T(v_1), \ldots, T(v_n))$. I solved this by finding an encoding for constraining the rank of a matrix. With that I can simply encode $P(M v_1, \ldots, M v_2)$ where $M$ is a square matrix constrained to have full rank and which therefore is invertible.</p>
<p>There is nothing particularly novel about my encoding, but there are many ways to approach this so I wanted to share my solution.</p>
<p>I assume that encoding field operations is no problem. Currently I’m working in the finite field $\mathbb F_2$, so encoding to propositional logic is trivial. When working in other fields using an SMT-Solver might be more convenient, although other finite fields can be encoded to propositional logic without too much hassle.</p>
<p>When we want to check the rank of a matrix by hand, we’re probably going to use Gaussian elimination to transform the matrix into row echelon form (with arbitrary non-zero leading entries). We then get the rank as the number of non-zero rows. The iterative nature of Gaussian elimination where we are swapping rows and adding multiples of rows to other rows doesn’t lead to a nice encoding. The naive way to encode this would require a copy of the whole matrix for each step of the algorithm.</p>
<p>What we can use instead, is the fact that Gaussian elimination effectively computes an <a href="https://en.wikipedia.org/wiki/LU_decomposition">LU decomposition</a> of a matrix. After performing Gaussian elimination on a matrix $A$, the result will be a matrix $U$ in row echelon form, so that there are matrices $P, L$ with $PA = LU$ where $P$ is a permutation matrix and $L$ is a lower unitriangular matrix (triangular with the diagonal all ones). The permutation matrix $P$ corresponds to the swapping of rows and the matrix $L$ corresponds to adding multiples of a row to rows below it. While Gaussian elimination may swap rows after already adding multiples of a row to it, it will never move a row $i$ above a row $j$ when a multiple of row $j$ was already added to row $i$. That explains why $L$ is still unitriangular even when we swap rows.</p>
<p>You might have noticed that I ignored the details for non-square matrices until now. I will assume that our matrix is wider than tall. As column rank and row rank are the same for a matrix, this is not a restriction as we can transpose a tall matrix to a wide. For the $PA = LU$ decomposition, with $A$ being a $m \times n$ matrix with $m \le n$ the matrix $P$ will be $m \times m$, the matrix $L$ will be $m \times m$ and $U$ will be $m \times n$.</p>
\begin{align}
\underbrace{\begin{pmatrix}
0 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 0
\end{pmatrix}}_{P}
\cdot
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 4 & 6 \\
1 & 1 & 1
\end{pmatrix}}_{A} =
\underbrace{\begin{pmatrix}
1 & 0 & 0 \\
2 & 1 & 0 \\
0 & 1 & 1
\end{pmatrix}}_{L}
\cdot
\underbrace{\begin{pmatrix}
1 & 1 & 1 \\
0 & 2 & 4 \\
0 & 0 & 2
\end{pmatrix}}_{U}
\end{align}
<p>In general the permutation matrix $P$ is not uniquely determined. In iteration $i$ there might be multiple rows $j \ge i$ with the fewest zeros on the left, so different choices will lead to different LU decompositions:</p>
\begin{align}
\underbrace{\begin{pmatrix}
0 & 1 & 0 \\
1 & 0 & 0 \\
0 & 0 & 1
\end{pmatrix}}_{P}
\cdot
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 4 & 6 \\
1 & 1 & 1
\end{pmatrix}}_{A} =
\underbrace{\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
\frac{1}{2} & -\frac{1}{2} & 1
\end{pmatrix}}_{L}
\cdot
\underbrace{\begin{pmatrix}
2 & 4 & 6 \\
0 & 2 & 6 \\
0 & 0 & 1
\end{pmatrix}}_{U}
\end{align}
<p>For full rank matrices, after we fix a $P$ though, the requirement of $L$ being unitriangular and $U$ being in row echelon form completely determines them. For lower rank matrices, some of the entries of $L$ don’t affect the result, so we need to force them to 0 if we want a unique decomposition.</p>
<p>This is already much nicer to encode. The property of a matrix being unitriangular, a permutation or in row echelon form have straight forward encodings. A matrix product is also easy to encode. Nevertheless we can improve a bit and get rid of the permutation matrix and get a uniquely determined decomposition instead.</p>
<p>To do this we need to relax the row-echelon form to something slightly less constrained. We need to perform row swaps to get a non-zero entry in the leftmost position possible. Assuming the matrix is full rank, we could just relax the constraint for the non-zero entry to be the leftmost of all remaining rows and instead take the rightmost non-zero entry in the current row. We still require that all entries below that non-zero entry are zero, but there might be other non-zero entries below <em>and</em> left of it. We wouldn’t even need to select the leftmost non-zero entry, and could instead select any non-zero entry. Choosing the leftmost makes the choice unique and adds a nice symmetry as we require all entries left of and below of it to be zero.</p>
<p>This corresponds to running Gaussian elimination on a suitably column-permuted matrix so that we never require row swaps. This works fine in the full rank case, but may not work for lower ranks. To get around that issue we simply allow and skip all-zero rows anywhere in the matrix. A succinct characterization of the resulting matrices is this: The first non-zero entry of each row is the last non-zero entry of its column. I’m not aware of a name for these matrices, if you know how they are called <a href="https://math.stackexchange.com/questions/3030147/name-of-matrices-where-the-first-non-zero-entry-of-each-row-is-the-last-non-zero">please let me now</a>.</p>
<p>With this we can write any square or wide matrix $A$ as $A = LU’$ where $U’$ is of this form and $L$ is lower unitriangular. If $A$ is full rank this is unique, otherwise it is unique up to the entries of $L$ that are multiplied just with zeros. The rank still corresponds to the number of non-zero rows in $U’$, although they may be anywhere in the matrix.</p>
\begin{align}
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 4 & 6 \\
1 & 1 & 1
\end{pmatrix}}_{A} =
\underbrace{\begin{pmatrix}
1 & 0 & 0 \\
2 & 1 & 0 \\
\frac{1}{2} & \frac{1}{2} & 1
\end{pmatrix}}_{L}
\cdot
\underbrace{\begin{pmatrix}
0 & 2 & 6 \\
2 & 0 & -6 \\
0 & 0 & 1
\end{pmatrix}}_{U'}
\end{align}
<p>Encoding this for SAT or SMT solver is now straightforward: We encode the required form of $L$ and the require form of $U’$, both simple constraints on non-zero or 1 entries, as well as the number of required non-zero rows of $U’$. The uniqueness of the decomposition ensures that the SAT solver will not spend time exploring the same matrix just expressed differently.</p>
<p><strong>Update:</strong> For full rank matrices you can also just ask for the existence of an inverse matrix, which is even simpler to encode, but has a different runtime behavior. As usual with SAT solvers it’s always worth to try different approaches as it’s hard to estimate which will be faster for a given problem.</p>Varisat 0.1.3: LRAT Generation and Proof Trimming
https://jix.one/varisat-0.1.3-lrat-generation-and-proof-trimming/
Fri, 14 Sep 2018 14:54:02 +0200me@jix.one (Jannis Harder)https://jix.one/varisat-0.1.3-lrat-generation-and-proof-trimming/<p>I’ve released a new version of my SAT solver <a href="https://jix.one/project/varisat">Varisat</a>. It is now split across two crates: one for <a href="https://crates.io/crates/varisat">library usage</a> and one for <a href="https://crates.io/crates/varisat-cli">command line usage</a>.</p>
<p>The major new features in this release concern the genration of unsatisfiability proofs. Varisat is now able to directly generate proofs in the <a href="https://www.cs.utexas.edu/~marijn/publications/lrat.pdf">LRAT</a> format in addition to the DRAT format. The binary versions of both formats are supported too. Varisat is also able to do on the fly proof trimming now. This is similar to running <a href="https://www.cs.utexas.edu/~marijn/drat-trim/">DRAT-trim</a> but processes the proof while the solver runs.</p>
<p>LRAT is an alternative to the DRAT format for unsatisfiability proofs. LRAT proofs are more verbose but faster and easier to check. This is because an LRAT proof contains the propagation steps needed to justify a learned clause, while DRAT requires the checker to rediscover them.</p>
<p>The usual way to generate an LRAT proof is to generate a DRAT proof first. This DRAT proof is then converted to an LRAT proof using DRAT-trim. I figured that it would be much faster to generate the LRAT proof directly from the SAT solver and was <a href="https://www.cs.utexas.edu/~marijn/publications/lrat.pdf#page=6">not convinced that the overhead or complexity of the implementation would be prohibitve</a>.</p>
<p>I still need to do more systematic benchmarking, but preliminary testing gave
promising results. The runtime for direct LRAT generation was often around or less than half the time needed for DRUP generation followed by conversion.</p>
<p>The code I added for direct LRAT generation made it also easy to incorporate a trimming feature similar to DRAT-trim but on the fly. Varisat can buffer a certain amount of proof steps and whenever the buffer is full it removes all steps leading only to deleted and unused clauses. I haven’t compared the effectiveness of this trimming approach to DRAT-trim but the runtime overhead is similar to direct LRAT generation.</p>Introducing Varisat
https://jix.one/introducing-varisat/
Sun, 20 May 2018 15:42:27 +0200me@jix.one (Jannis Harder)https://jix.one/introducing-varisat/<p>I’ve been interested in <a href="https://en.wikipedia.org/wiki/Boolean_satisfiability_problem#Algorithms_for_solving_SAT">SAT solvers</a> for quite some time. These are programs that take a boolean formula and either find a variable assignment that makes the formula true or find a proof that this is impossible. As many difficult problems can be rephrased as the satisfiability of a suitable boolean formula, SAT solvers are incredibly versatile und useful. I’ve recently finished and now released a first version of my SAT solver, <a href="https://crates.io/crates/varisat">Varisat</a>, on crates.io.</p>
<p>Most modern state of the art SAT solvers are based on the <a href="https://en.wikipedia.org/wiki/Conflict-Driven_Clause_Learning">conflict driven clause learning (CDCL)</a> algorithm. With some handwaving this algorithm could be seen as a clever combination of recursive search, backtracking, resolution and local search.</p>
<p>The CDCL algorithm uses a lot of heuristics and can be extended in many ways. This is where different CDCL based solvers take different approaches and where a lot of active research happens.</p>
<p>A few years ago I decided to write my own CDCL based SAT solver. I wanted to get an in depth understanding of the CDCL algorithm and also have a code base I’m familiar with so I can easily experiment with new ideas. I started writing several prototypes. First I used C++ and later I switched to Rust. Earlier this year I decided that my current prototype was good enough to turn into a complete, usable solver. Just in time to enter this year’s <a href="http://sat2018.forsyte.tuwien.ac.at/">SAT competition</a>.</p>
<p>As varisat is in an early stage of development, implementing little beyond the minimum
required for a modern CDCL based SAT solver, I don’t expect it to win any prizes in the competition. There are many problem instances that benefit a lot of additional techniques that varisat just doesn’t offer yet. Nevertheless I was pleasently surprised to find that it already is competitive for some graph coloring instances I needed to solve in the meantime.</p>
<p>Besides turning varisat into a library (command line only for now), I plan to incrementally add more and more of the proven techniques used by state of the art solvers. I also want to try some of my ideas for novel techniques and hope to find the time to write more about working on varisat.</p>Not Even Coppersmith's Attack
https://jix.one/not-even-coppersmiths-attack/
Sat, 23 Dec 2017 18:18:52 +0100me@jix.one (Jannis Harder)https://jix.one/not-even-coppersmiths-attack/<p>Earlier this year, in October, a new widespread cryptography vulnerability was announced.
The <a href="https://crocs.fi.muni.cz/public/papers/rsa_ccs17">initial announcement</a> didn’t contain details
about the vulnerability or much details on how to attack it (updated by now).
It did state the affected systems though: RSA keys generated using smartcards and similar devices that use Infineon’s RSALib.
The announcement came with obfuscated code that would check whether a public key is affected.
Also, the name chosen by the researchers was a small hint on how to attack it: “Return of Coppersmith’s Attack”.</p>
<p>I decided to try and figure out the details before the conference paper describing them would be released.
By the time the paper was released, I had reverse engineered the vulnerability and implemented my own attack, which did not use Coppersmith’s method at all.
This post explains how I figured out what’s wrong with the affected RSA-keys and how I used that information to factor affected 512-bit RSA-keys.</p>
<h2 id="reversing-the-vulnerability">Reversing the Vulnerability</h2>
<p>I started looking at the vulnerability, when a friend pointed me to a
deobfuscated version of the detection code:</p>
<blockquote>
<p>So this is the core of the Infineon RSA fail key detector: <a href="https://marcan.st/paste/MOEoh2EH.txt">https://marcan.st/paste/MOEoh2EH.txt</a> - this is very interesting indeed (and a huge fail).</p>
<p>– <a href="https://twitter.com/marcan42/status/921297567664652288">@marcan42</a> on twitter</p>
</blockquote>
<p>At that point the ROCA paper wasn’t published yet.
Figuring out how these keys are generated and how to attack them seemed like a
nice challenge.</p>
<p>The detection code gives a first hint on this.
It takes the public modulus $N$ and reduces it modulo a set of small
primes $\{p_0, p_1, p_2, \ldots, p_{m}\}$
.
For each prime $p_i$ it tests whether the remainder belongs to a set of allowed
remainders $R_i$.
If all remainders are in the corresponding set of allowed remainders, the key
is flagged as vulnerable.</p>
<p>The first few tests are:
\begin{align}
N \bmod 11 &\in \{1, 10\} \\
N \bmod 13 &\in \{1, 3, 4, 9, 10, 12\} \\
N \bmod 17 &\in \{1, 2, 4, 8, 9, 13, 15, 16\} \\
N \bmod 19 &\in \{1, 4, 5, 6, 7, 9, 11, 16, 17\} \\
N \bmod 37 &\in \{1, 10, 26\} \\
\vdots
\end{align}
</p>
<p>This doesn’t look good.</p>
<p>If you take a random RSA key or large prime and reduce it modulo small primes,
you’re expected to see all non-zero remainders evenly distributed.
You can’t get a zero, as that would mean there is a small prime factor.
For the full list of 17 small primes, only one of
$\prod_i \frac{p_i - 1}{|R_i|} \approx 2^{27.8}$ possible keys has this
property.</p>
<p>While an unintended loss 27.8 bits of entropy sounds bad as is, I assumed that
this is only a symptom of whatever went wrong when generating those keys.
While it would be possible to generate an RSA key from a uniform distribution
of keys like this, it would be slower and more complicated than the straight
forward correct way.
You’d also have to deliberately restrict the allowed remainders, which seemed unlikely.</p>
<p>To figure out the flaw in Infineon’s RSALib, let’s first look at properly
generating RSA keys.
[Disclaimer: Don’t use this blog post as reference for implementing this.]
The public modulus $N$ is the product of two large primes $P, Q$ of roughly
equal bit size.
You can constrain the bit size of $P, Q$ and $N$ by uniformly selecting primes
$P$ and $Q$ from a suitable interval $I$.</p>
<p>The easiest way to uniformly select an element with a given property $T$ in an
interval $I$ is rejection sampling:<sup class="footnote-ref" id="fnref:1"><a href="#fn:1">1</a></sup>
Uniformly select <em>any</em> element $x \in I$ (easy), check whether the property
$T(x)$ holds (hopefully easy), restart if it doesn’t.
The average number of iterations rejection sampling needs is inversely
proportional to the probability of a random $x \in I$ having the property.
The prime number theorem tells us that the probability of a random number
smaller than $N$ being prime is $\frac{1}{\log N}$.
For generating primes using rejection sampling this gets us a number of
iterations that grows linearly with the bit size.</p>
<p>This is already quite efficient, but can be optimized.
A simple way to halve the expected number of required iterations is to sample a
uniform odd number instead of any number within $I$.
A further improvement would be to uniformly sample an odd number that is not
divisible by three, but this already isn’t as straight forward anymore.
It’s possible to continue like this by constructing and uniformly sampling more
and more intricate subsets of $I$ that still contain all primes of $I$.
But it is also getting harder to correctly do this, while the possible speedup
is getting smaller and smaller.</p>
<p>This looks like a good place to screw up key generation, so let’s keep that in
mind and look at the RSALib generated keys again.
Assuming $P$ and $Q$ are generated independently and $N \bmod p_i \in R_i$,
there must be a $R’_i$ so that $P \bmod p_i \in R’_i$ and
$Q \bmod p_i \in R’_i$, i.e. $N$ can only be restricted modulo a small prime if
$P$ and $Q$ also are.
As any combination should be possible, we expect
$R_i = \{ab \bmod p_i \mid a, b \in R'_i \times R'_i \}$
.</p>
<p>Playing around with the numbers in the detection code quickly shows that
multiplying any two numbers in $R_i$ always results in another number in
$R_i \pmod{p_i}$.
Together with $1 \in R_i$ for all $R_i$, this lead me to assume $R’_i = R_i$.
I didn’t rule out other possibilities in general, but for
$R_0 = \{1, 10\}$ nothing else would work.</p>
<p>The next step was to identify what lead to the specific sets $R_i$.
We start with some observations:
As $R_i$ doesn’t contain zero, it is a <em>subset</em> of $\zzm{p_i}$, the
multiplicative group of integers modulo $p_i$.
We also discovered that $R_i$ is closed under multiplication modulo $p_i$.
This makes $R_i$ also a <em>subgroup</em> of $\zzm{p_i}$.
As $p_i$ is a prime, $\zzm{p_i}$ is a cyclic group, and thus $R_i$ is also a
cyclic group.
In particular this means there is a generator $a_i$, so that
$R_i = \{a_i^k \mid k \in \mathbb Z \}$
.</p>
<p>This is not exciting yet, but so far we only looked at what happens modulo
individual small primes.
I considered it much more likely that the RSALib code worked modulo the product
$M$ of several small primes.</p>
<p>At that point I was thinking this: If, modulo a small prime $p_i$, all possible
values are generated by an element $a_i$ that is not a generator for the whole
group $\zzm{p_i}$, could it be that, modulo $M = \prod_i p_i$, all possible
values are also generated by a single element $a$, so that $a_i = a \bmod p_i$?</p>
<p>This would be a bad idea, but we already know that someone went ahead with a
bad idea, so concerning our hypothesis it’s a point in favor.
So why is it a bad idea?
Generating a prime candidate $P$ so that $P \bmod M$ is in $\zzm{M}$ sounds like
a good idea.
It would exclude all values that have a common factor with $M$, and thus cannot
be prime, making our candidate more likely to be prime.
So far that’s not a problem.
The problem is sampling from $\zzm{M}$ by raising a single value $a$ to a
random power.
$\zzm{p_i}$ are cyclic groups, as $p_i$ is prime, and thus they do have a
generating element $b_i$.
It’s just not the $a_i$ used.
In general for composite $k$ the group $\zzm{k}$ is not cyclic, i.e. it is not
generated by a single element.
So whatever $a$ they used, it only generates a subgroup $R$ of $\zzm{M}$.
Even worse, that subgroup $R$ would be a lot smaller than
$R_0 \times R_1 \times … \times R_m$, as that group again isn’t cyclic.
The order of $R$ is given by $|R| = \lcm_i |R_i|$.
This can be seen by considering that
$a_i^k \equiv a_i^{k \bmod |R_i|} \pmod{p_i}$ and counting the possible
combinations of $k \bmod |R_i|$ and $k \bmod |R_j|$.</p>
<p>At this point only one in $\frac{|\zzm{M}|}{|R|} \approx 2^{69.95}$ possible
primes could be generated, but we haven’t validated our assumption yet.</p>
<p>Equipped with the test vectors that came with the original detection code, I
searched for a matching generator $a$ modulo a subset of the small primes.
I did this by combining all possible combinations of $a_i$ using the Chinese
remainder theorem (CRT).
I started with a small subset of the small primes, as this was much faster and
could falsify the hypothesis if no match was found.
As soon as $65537$ appeared as a candidate I knew that my guesses were right.
$65537 = 2^{16} + 1$ is a prime larger than our small primes, thus coprime to M
and would be a generator of $\zza{M}$ the <em>additive</em> group of integers modulo
M, which <em>is</em> a cyclic group.
Also multiplication with $2^{16} + 1$ can be very fast, especially on 8 and
16-bit microcontrollers.</p>
<p>Confusing the properties of $\zza{M}$ and $\zzm{M}$ could be an explanation of
why someone inexperienced with cryptography wouldn’t see a problem with this
approach.
It does not explain why someone was allowed to go ahead with their own way of
generating primes or why no one able to spot this mistake reviewed or audited this algorithm,
especially given the intended applications.</p>
<p>We’re not quite done identifying the vulnerability yet.
When looking at the set of small primes used, you can see that some primes are
skipped.
But what if they were only skipped in the deobfuscated and optimized detection
code, because $a$ happened to be a generator for $\zzm{p_i}$?
In fact when marcan published the deobfuscated code he mentioned that he
removed no-op tests.
Even if $a$ is a generator for $\zzm{p_i}$ we shouldn’t discard it, as the
cyclic subgroup $R$ generated by $a$ modulo $M$ is smaller than the product of
the individual subgroups $R_i$.</p>
<p>Using the test vectors I verified that for 512-bit keys the set of small primes
consists of all primes up to $167$.
Recomputing the size of the cyclic subgroup $R$ of $\zzm{M}$ shows that only
one in $\frac{|\zzm{M}|}{|R|} \approx 2^{154.89}$ possible primes can be
generated.
This loses more than half of the expected entropy.</p>
<h2 id="the-attack">The Attack</h2>
<p>Having so much information about the private key can be enough to very quickly
factor it.
Ignoring the kind of information we have, just counting the bits of entropy, it
could be possible to efficiently factor the key using variants of Coppersmith’s
method.
The CA in ROCA also stands for Coppersmith’s attack, but a straightforward
application isn’t possible.
While the entropy of the information we gained from this vulnerability is
enough, it doesn’t have the right form.</p>
<p>Coppersmith’s method is applicable if we know that a factor has the form $c +
kx$ for fixed $c$ and $k$, and $|x| < N^{\frac{1}{4}}$.
This is the case when we know consecutive bits of the binary representation or
know the factor modulo any number of a suitable size.
In our case we only know that the factors have the form $(a^i \bmod M) + Mx$
for small $i$ and $x$.
If we could afford to just bruteforce all possible values for $i$, we could
apply Coppersmith’s method, assuming $|x| < N^{\frac{1}{4}}$ holds.</p>
<p>There are $|R| \approx 2^{61.09}$ possible values for $i$.
So at first, this looks too expensive.
On the other hand, in this case $|x| < \frac{2^{257}}{M} \approx 2^{37.81} \ll N^{\frac{1}{4}}$,
so maybe it is possible to find some trade-off.</p>
<p>We are looking for a way to make $|R|$ and $M$ smaller, without making $M$ too
small.
Luckily this is easy: we can just ignore some of the small primes $p_i$.
This results in a smaller $M’$, just the product of the new primes
$\prod_i p’_i$, and a smaller $R’ \subset \zzm{M’}$.
As $|R’|$ depends on the common factors of $|R’_i|$, it can be a bit
difficult to find an optimal trade-off.</p>
<p>I implemented this attack using the Coppersmith implementation of
<a href="https://pari.math.u-bordeaux.fr/">PARI/GP</a>, but no matter what trade-off I
chose, my estimated runtime was much higher than the published one.
As this is the attack described in the ROCA paper, in retrospect, I think the
Coppersmith implementation I chose was just not optimized enough for this use
case.
In addition to that I might have missed the optimal choice for $M’$, but even a
single invocation of Coppersmith’s method was much slower for me.</p>
<p>This prompted me to try different approaches.
I hit many dead ends, until I came across an older but interesting attack for
factoring RSA keys with partial knowledge of a factor.
The attack was published in 1986 by Rivest and Shamir, the R and S in RSA.
The paper is called <a href="https://link.springer.com/chapter/10.1007/3-540-39805-8_3">“Efficient Factoring Based on Partial Information”</a>.</p>
<p>Compared to Coppersmith’s method it has the downside of needing not only half
the bits of a factor, i.e. $|x| < N^{\frac{1}{4}}$, but two thirds, i.e. $|x| <
N^{\frac{1}{3}}$.
This might seem bad at first, but as $M’$ grows faster than $|R’|$, the bruteforcing
work doesn’t increase as much as going from $N^{\frac{1}{4}}$ to
$N^{\frac{1}{3}}$ might suggest.
Also, I guessed that it would be so much faster than Coppersmith’s method,
that, for 512-bit keys, it would more than make up for that.</p>
<p>To understand why, we need to look at how the attack works.
The paper describes a slightly different scenario.
It assumes we know the factors have the form $x + 2^m c$.
This corresponds to knowing the most significant bits of the binary
representation of a factor.
The described attack works without modification for factors of the form $x +
Mc$ as it doesn’t make use of the fact that $2^m$ is a power of two.</p>
<p>If we assume $P = x + M c$ and $Q = y + M d$ with we get
\begin{align}
N &= xy + dxM + cyM + cdM^2.
\end{align}
We also assume that $0 \le x \le M$ and $0 \le y \le M$.</p>
<p>Let $t = N - cdM^2$, a constant we know, and we get
\begin{align}
t &= xy + Mdx + Mcy.
\end{align}
The paper then presents a heuristic argument, which is roughly this:
Because $xy$ is much smaller than $t$, $Mdx$ and $Mdy$ it is likely that
replacing $xy$ with $s$ and searching for the solution minimizing $s$ in
\begin{align}
t &= s + Mdx + Mcy
\end{align}
results in a solution to the original equation and thereby in a factorization of $N$.</p>
<p>This is a two-dimensional integer programming instance, i.e. a set of
integer linear constraints (the bounds for $x$ and $y$), an integer linear
objective (minimize $s = t - Mdx + Mcy$) and two integer unknowns ($x$ and
$y$).
It is then noted that integer programming in a fixed number of dimensions can
be solved in polynomial time.</p>
<p>The paper also mentions that a similar approach would work for knowing the
<em>least</em> significant bits of a factor.
This corresponds to $P = c + Mx$ and $Q = d + My$ with $0 \le x \le \sqrt{M}$
and $0 \le y \le \sqrt{M}$, which is exactly what we need.</p>
<p>In this case we get
\begin{align}
N &= cd + dxM + cyM + xyM^2 \\
t &= \frac{N - cd}{M} \\
t &= dx + cy + xyM.
\end{align}
</p>
<p>Again, we’d like to get rid of the $xy$ term, to make it a linear problem.
I did this by working modulo $M$:
\begin{align}
t &\equiv dx + cy \pmod{M}
\end{align}
</p>
<p>Usually for RSA-keys we know an upper bound for $P$ and $Q$, which together
with $N$ also translates to a lower bound.
From this we can compute bounds for $x$ and $y$.</p>
<p>Here I noticed, that it is possible to find a solution using an approach more
direct than integer programming.
The solutions to $t \equiv dx + cy \pmod{M}$ form a two-dimensional affine lattice.
To understand how we need to define lattices first.</p>
<p>Given a $n \times d$ matrix $B$ consisting of $d$ linear independent column vectors $B = (\mathbf b_0, \mathbf b_1, \ldots, \mathbf b_{d-1})$ the corresponding lattice $L$ is the set of integer linear combinations of these vectors:
\begin{align}
L = \{B \mathbf g \mid g \in \mathbb Z^d\}
\end{align}
As this set is closed under negation and addition, it forms a subgroup of $\mathbb R^n$.
Luckily we are working in two dimensions, which makes it easy to visualize lattices:</p>
<p><img src="lattice.svg" alt="Two-dimensional lattice example" class="large-figure">
The green dots are the lattice points and the red vectors are the basis vectors.</p>
<p>The basis vectors for a lattice are not unique, adding an integer multiple of
one basis vector to another generates the same lattice.
This is easy to see, as you can get the original lattice vector by subtracting
the same multiple again, so every integer linear combination of either basis is
also an integer linear combination of the other.
Negating a basis vector or exchanging the position of two vectors also doesn’t
change the generated lattice.
Performing an arbitrary number of those operations is equivalent to
multiplying the basis $B$ with an unimodular matrix $U$, i.e. an integer matrix $U$ with $|\det U| = 1$.
This makes sense as those matrices are exactly the integer matrices which have an integer inverse.</p>
<p><img src="equivalent.svg" alt="Two equivalent bases" class="large-figure"></p>
<p>$\mathbf b_0, \mathbf b_1$ and $\mathbf b’_0, \mathbf b’_1$ define the same lattice: $\mathbf b’_0 = \mathbf b_0 + \mathbf b_1, \mathbf b’_1 = \mathbf b_1 - 2\mathbf b’_0$.</p>
<p>An affine lattice is a lattice with an offset.
Given a basis $B$ and an offset vector $\mathbf o$ it consists of the lattice points
$A = \{B \mathbf g + \mathbf o \mid g \in \mathbb Z^d \}$.
This is not a group anymore, but adding an element of $L$ to $A$ gives another
point in $A$ and the difference of two points in $A$ is in $L$.</p>
<p>I claimed that the solutions to $t \equiv dx + cy \pmod{M}$ form an affine lattice.
Assume we have a single known solution $(x_0, y_0)$.
It’s not hard to see that adding multiples of $M$ to $x_0$ or $y_0$ is still a valid solution.
These solutions would form an affine lattice, using the basis vectors $(M, 0)$ and $(0, M)$, but that lattice would not contain all solutions.
We know that $c$ and $d$ are coprime to $M$, otherwise $P$ or $Q$ would have a small factor.
This means that we should have a solution $(x, \frac{t - dx}{c})$ for any value of $x$.
Taking the difference of two solutions with consecutive $x$ gives us a basis vector $\mathbf b_0 = (1, \frac{-d}{c})$.
Together with $\mathbf b_1 = (0, M)$ and $\mathbf o = (0, \frac{t}{c})$ this defines an affine lattice containing all solutions.</p>
<p>Given this affine lattice, we’re interested in the lattice points within the region defined by our bounds for $x$ and $y$.
If we can find a lattice point closest to a given arbitrary point, we could compute the closest point to the center of that region.
In general for arbitrary lattice dimensions that problem is NP-complete.
Luckily for two dimensional lattices this is very efficient.</p>
<p>The difficulty in finding a closest lattice points stems from the fact that basis vectors can point in roughly the same or opposite direction.
In fact for our affine lattice of solutions to $t \equiv dx + cy \pmod{M}$, the basis vectors we derived point in almost the same direction.
Let’s, for a moment, assume the opposite would be the case and that the basis vectors are orthogonal.
We could then just represent a point as a non-integer multiple of the basis vectors and individually round the multiples to the nearest integer.
As moving along one basis vector direction doesn’t affect the closest multiple in the other directions, we would get the nearest point.</p>
<p>When the basis vectors aren’t exactly orthogonal but close, it is possible to bound the distance when approximating the nearest point by independent rounding in each basis vector direction.
Consider the two-dimensional case: rounding in direction of $\mathbf b_0$ moves the point by $\mu_{0,1} = \frac{\sp{\mathbf b_0}{\mathbf b_1}}{\norm{b_1}^2}$ times the length of $\mathbf b_1$ in the direction of $\mathbf b_1$.
The value of $\mu_{0,1}$ is the (signed) length of $\mathbf b_0$ projected onto $\mathbf b_1$, divided by the length of $\mathbf b_1$.</p>
<p><img src="projection.svg" alt="Projection example" class="large-figure">
The orange vector is the projection of the blue vector onto the red vector. It is equal to $\mu$ times the red vector.</p>
<p>This is great because we can find an equivalent basis so that $|\mu_{0,1}|$ and $|\mu_{1,0}| \le \frac{1}{2}$.
This is done using the Lagrange-Gauss algorithm, which finds the shortest basis for a two-dimensional lattice.
It works similar to the Euclidean algorithm for computing the greatest common divisor of two numbers by repeatedly reducing one value by the other.
Let $\lfloor x \rceil$ be the closest integer to $x$.
If $|\mu_{0,1}| > \frac{1}{2}$ the vector $\mathbf b_1 - \lfloor \mu_{0,1} \rceil \mathbf b_0$ is shorter than $\mathbf b_1$ and can replace it.
The same is true with exchanged basis vectors and $\mu_{1,0}$.
Replacing one lattice vector with a shorter one like this can be iterated until neither $|\mu_{0,1}|$ nor $|\mu_{1,0}|$ are greater than $\frac{1}{2}$.
For basis vectors with integer components the number of iterations needed grows logarithmically with the length of the basis vectors, i.e. linear with their bit size.</p>
<p>For such a two-dimensional basis, finding a close point by rounding the multiples of the basis vectors results in a very small distance bound.
I haven’t computed the exact bound, but rounding introduces an offset of at most half a basis vector in each basis vector direction.
This would introduce an error of at most a quarter basis vector each, which is enough to cross the midpoint between two multiples, but not enough to go further.
In practice, for the lattices we’re interested in, the point found by rounding happens to also be the closest point.</p>
<p>With this we can find a solution to $t \equiv dx + cy \pmod{M}$ which is closest to the midpoint of the rectangle defined by the bounds for $x$ and $y$.
Most of the time this solution is the unique solution within bounds that leads to a factoring of $N$ if such a solution exists.
Sometimes, though, when one basis vector is particularly short, there are multiple solutions within bounds.
Luckily it seems that this only happens when the other basis vector is long.
This means that all solutions within bounds lie on a single line.
In that case a solution can be efficiently found or shown to not exist by recursively bisecting the line and checking whether a point that factors $N$ can exist between the endpoints.</p>
<p>This gives us a complete algorithm to check a single candidate.
Together with an optimized value for $M’$ and an outer loop that bruteforces through the $|R’|$ possible guesses, this allows me to break a ROCA-vulnerable 512-bit RSA key in less than $900$ seconds using a single thread on my laptop.
As the outer loop can be trivially parallelized, breaking those keys on a more powerful server with many threads takes less than $30$ seconds.
I’ve also looked at using this approach for 1024-bit keys, but a rough estimate put the runtime far above the runtime of the ROCA attack.
For larger keys it is even worse, so I didn’t pursue that path.</p>
<h2 id="source-code">Source Code</h2>
<p>I’ve decided to release the <a href="https://gitlab.com/jix/neca">source code</a> of my attack implementation.
It’s implemented in C++ and uses <a href="https://gmplib.org/">GMP</a> for most bignum arithmetic, except inside the lattice reduction where custom routines are used.
It includes some low-level optimizations that I’ve glossed over, for example using floats to approximate bignums while keeping track of the accumulated error.</p>
<p>Feel free to contact me if you want to know more about specific parts or about the implementation.
I’ll also be at the <a href="https://events.ccc.de/congress/2017/wiki/index.php/Main_Page">34c3</a> and am happy to have a chat with anyone interested in this or related things.</p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
TeX: {Macros: {
ord: "\\mathop{\\rm ord}",
lcm: "\\mathop{\\rm lcm}",
zzm: ["\\mathbb Z_{#1}^*", 1],
zza: ["\\mathbb Z_{#1}^+", 1],
sp: ["\\langle #1, #2 \\rangle", 2],
norm: ["\\lVert #1 \\rVert", 1]
}}
});
</script>
<script src='https://jix.one/js/MathJax/MathJax.js?config=TeX-MML-AM_CHTML'></script>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Rejection sampling also allows for non-uniform source and target distributions, but simplifies to the described algorithm for a uniform source distribution and a target distribution that is uniform among all values of a given property and zero otherwise.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>Pushing Polygons on the Mega Drive
https://jix.one/pushing-polygons-on-the-mega-drive/
Tue, 16 May 2017 20:37:45 +0200me@jix.one (Jannis Harder)https://jix.one/pushing-polygons-on-the-mega-drive/<p>This is a write-up of the polygon renderer used for the Mega Drive demo <a href="https://www.pouet.net/prod.php?which=69648">“Overdrive 2”</a> by Titan, released at the <a href="https://2017.revision-party.net/">Revision 2017</a> Demoparty.
As the Mega Drive can only display tilemaps, not bitmaps, and does not have the video memory mapped into the CPU address space, this turned out to be an interesting problem.
If you have not seen the demo yet, I recommend watching it before continuing.
You can find a <a href="https://youtu.be/gWVmPtr9O0g">hardware capture on YouTube</a>:</p>
<div style="position: relative; padding-bottom: 56.25%; padding-top: 0; height: 0; overflow: hidden; margin: 1em 0;">
<iframe src="https://jix.one/youtube/gWVmPtr9O0g" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;" allowfullscreen frameborder="0" title="YouTube Video"></iframe>
</div>
<p>Ironically, the 3D shown in YouTube’s preview screenshot is not done using the renderer described here; the Mega Drive and ship scenes are, though.</p>
<h2 id="3d-renderer-or-video-player">3D Renderer or Video Player?</h2>
<p>After our demo was shown at Revision, people naturally started speculating about how we realized several of the effects.
Some quickly concluded that a complete 3D renderer in fullscreen, with shading and at that framerate seems implausible and I would not disagree.
An alternative theory was that it is just streaming the frames (as nametable and patterns) from the cartridge.
That would certainly work and would actually be faster.
It has the huge downside of taking way too much ROM space though.
A back-of-the-envelope calculation,
assuming deduplication of solid-colored tiles and a slightly higher framerate,
ends up at roughly half of our 8MB ROM being used for 3D frames.
That is about ten times as much as the 3D scenes are using now.
Together with the large amount of PCM samples used by Strobe’s awesome soundtrack that would not leave much for the other effects or Alien’s beautiful graphics.
While there are some ways to compress frames a bit further, we decided on something more suitable for the 3D scenes.</p>
<h2 id="vector-animations">Vector Animations</h2>
<p>If a complete 3D renderer is not possible, it might be a good idea to pre-compute parts of it and only do the final steps on the Mega Drive.
That would be rasterization of projected, shaded and clipped polygons.
The data needed to describe a frame at that stage is quite small.
It consists of the polygon vertices in fixed-point screen coordinates and a palette index for the color.
Fixed-point is needed as rounding to integer coordinates looks wobbly when animated.
Choosing a suitable palette for each frame can be done during the preprocessing.
Storing it takes just a few bytes.
This is a good starting point, but now we need to figure out how to make drawing polygons fast.</p>
<p>There are three optimizations used to speed up polygon rendering:</p>
<ol>
<li>Avoiding overdraw,</li>
<li>Drawing from left to right,</li>
<li>Quickly drawing tiles.</li>
</ol>
<p>I will go through them in order and explain the details.</p>
<h2 id="flattening">Flattening</h2>
<p>There are two problems with just a list of projected polygons.
First, the order in which they are drawn matters.
Assuming no intersecting polygons, sorting them from back to front gives the correct result.
But this still leaves us with a second problem: overdraw.
This is not a problem in the sense that the rendering breaks, but rather in that a pixel is wastefully drawn multiple times, discarding the previous value.
Having to draw into a tilemap amplifies this problem.</p>
<p>The solution is to split all polygons that intersect or overlap, throwing away any resulting polygon that is occluded by another polygon.
This leaves us with a partition or, more specifically, a tessellation of the view plane.
As a further optimization adjacent polygons that have the same color can be joined, as their common edge(s) are not visible.
This can result in quite complex, even non-convex, polygons.
Apart from a small exception described in the next section, this is not a problem though.</p>
<p><img src="flattening.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<h2 id="drawing-from-left-to-right">Drawing from Left to Right</h2>
<p>As we now have a tessellation of the view plane, drawing all polygons would compute and draw each edge twice; once for each adjacent polygon.
This can be avoided.
If we draw polygons strictly from left to right, we can use a single table to store all edges that have just one adjacent polygon drawn.
The table is just the x-position of the rightmost drawn edge for each scanline.
I am calling that table the “fringe table”.
It is initialized with all zeros, i.e. the left edge of the view.</p>
<p>There is a small problem though with some non-convex polygons.
If a polygon is U-shaped it is impossible to draw strictly from left to right; the enclosed polygon would have to go in between.
Rotated by 90 degree, i.e. a C-shaped polygon, it is not a problem though.
This is solved by breaking U-shaped polygons into multiple polygons that do not have gaps in any scanline.
As a small optimization those breaks are preferably inserted on a tile boundary.
Why this is an advantage will become clear later.</p>
<p><img src="breaking.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<p>When drawing polygons like this, there is no need to even store the left side of a polygon.
Whenever a polygon is drawn, the left edge is already stored in the fringe table.
So apart from the computation time saved, we also save storage space.</p>
<p>So far we only avoided re-computing the left edges of a polygon; we still need to draw them.
In fact there is no way to avoid drawing the left edges, but we can save time drawing the right edges.
As we know anything beyond the right edges will be overdrawn, we do not need to be exact while drawing.
An easy way to save some time while drawing into a tile map is to completely fill any tile containing a right edge, leaving it to the adjacent polygon to draw the exact edge.
This brings us to the next optimization.</p>
<h2 id="quickly-drawing-tiles">Quickly Drawing Tiles</h2>
<p>The final challenge to overcome is efficiently drawing to a tilemap.
As a first step the line drawing is decoupled from the handling of the tilemap.
This is done by introducing a copy of the fringe table, called the “outline table”.
In between drawing polygons those two tables are always the same.
The line drawing routine updates the outline table to contain the right side of the polygon.
This is done for all right side edges of the polygon before any actual drawing to the tilemap happens.
Afterwards the polygon to draw is exactly the area delimited to the left by the fringe table and to the right by the outline table.</p>
<p><img src="outline-setup.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<p>The line drawing routine also outputs the topmost and bottommost y-coordinates of the polygon.
Those are rounded outwards to the next tile boundary, i.e. a multiple of 8 pixels.
This is safe to do, as the fringe and outline table are identical for scanlines outside of the polygon area,
indicating that nothing should be drawn there.
This allows us to process the polygon tile-row by tile-row without a special case for partial tile-rows.</p>
<p>The tile-rows are processed from top to bottom.
First we compute three x-coordinates for each tile-row.
The leftmost and rightmost value in the fringe as well as the rightmost value in the outline of the tile-row.
Those span the area where the polygon needs to be drawn and also divide the tile-row into two segments.
The left segment contains edges of already drawn polygons while the right segment does not.
To avoid special cases, those values are rounded to tile boundaries too.</p>
<p><img src="segments.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<p>Both segments are then processed tile by tile from left to right.
The left segment is drawn before the right segment, but we will look at the right one which does not contain left edges or any already drawn polygons first.
All tiles in the right segment can be completely filled with the color of the current polygon.
This is legal as it will not draw over existing polygons and will only overshoot to the right.
The overshoot will be fixed by drawing subsequent polygons.
To speed this up even more, we precompute a solid-colored pattern for each of the 16 colors.
This means we can draw an 8x8 tile in the right segment by updating a single word in the nametable to point to the precomputed pattern.</p>
<p>Before this happens, though, the left segment is drawn.
Although we draw it first, we can be sure that every tile of the left segment was a tile of a right segment of a previous polygon.
This might sound counterintuitive, but it is possible as the left segment can consist of zero tiles, which it will for the leftmost polygons.</p>
<p>For each tile of the segment there are two cases we need to consider.
One is that the tile was only drawn as part of a right segment so far,
the other is that it was also part of one or more polygon’s left segments.</p>
<p>In the first case, the nametable entry for the tile points to one of the precomputed solid patterns.
In the second case, the nametable entry points to an individual pattern just for this tile.</p>
<p>If the tile was solid so far, we need change the nametable to point to an individual pattern.
This is done by using a simple bump allocator that allocates a continuous range of tiles.
Having a fixed pattern address for each tile would probably be faster here, but it would also mean that the used patterns are scattered throughout memory.
This is a huge downside on the mega drive as the VRAM is not memory mapped.
In fact, while drawing a frame, we are not updating the pattern data or nametable at all, but a shadow copy in work memory.
After we are done, a DMA transfer is used to quickly copy it over to VRAM.
At that point having a compact consecutive memory area containing all patterns saves a lot more cycles than using fixed pattern addresses here.</p>
<p>After allocating a pattern, we need to draw it.
A newly allocated pattern always needs to be drawn in two colors.
The color of the previously solid tile and the color of the current polygon.
Each line of the pattern will have the old color on the left and the new color on the right, potentially consisting of just one of those colors.
The fringe table tells us where the polygon edge is, i.e. where the color change must be.</p>
<p>Conveniently, a line of a pattern, consisting of 8 pixels, 4 bit each, perfectly fits into a 32 bit register of the 68000 CPU.
This allows us to apply a mask telling us where each color is supposed to go using a single <code>and.l (a0, d0), d1</code> instruction.
The register <code>d0</code> contains the value coming from the fringe table (multiplied by 4 to do long-word-wise addressing) and <code>d1</code> contains the data to mask.
The register <code>a0</code> points into a special table.
The table looks like this, where each line are the bits of a 32-bit long word.</p>
<pre><code class="language-plain">11111111111111111111111111111111
11111111111111111111111111111111
... many repetitions ...
11111111111111111111111111111111
11111111111111111111111111111111
00001111111111111111111111111111
00000000111111111111111111111111
00000000000011111111111111111111
00000000000000001111111111111111
00000000000000000000111111111111
00000000000000000000000011111111
00000000000000000000000000001111
00000000000000000000000000000000
00000000000000000000000000000000
... many repetitions ...
00000000000000000000000000000000
00000000000000000000000000000000
</code></pre>
<p>Depending on the x-coordinate of the tile <code>a0</code> points to a different position in that table.
By padding the table with enough all-one or all-zero words, there is no need to do any clipping to the tile boundary, which greatly speeds up drawing.
Together with a smaller table containing solid colored lines and some bit twiddeling this completes the newly allocated pattern drawing routine.</p>
<p>In the case of a pattern that was already allocated, the same approach is used.
The only difference is that the mask isn’t used to mask between two colors, but between the old data of a line and the new color.</p>
<p>After completing a row the corresponding entries of the outline table are copied into the fringe table and the next tile row is processed.
When all tile rows are drawn the fringe table is the same as the outline table again, ready for the next polygon to be drawn.</p>
<p><img src="polygon-complete.svg" alt="Flattening example using two cubes" class="large-figure"></p>
<h2 id="putting-it-all-together">Putting it All Together</h2>
<p>This concludes the description of our polygon renderer routine.
You can see the Mega Drive implementation in action below.
This animation was captured from an emulator running a patched version of the routine.
The patched version copies the nametable and pattern data into VRAM after every polygon and waits for the next frame.
The palette is updated in the end, resulting in the false colors while drawing is in progress.
The garbage tiles that appear sometimes are nametable entries that were not yet touched in the current frame.
Those entries might point to already allocated and redrawn patterns.</p>
<p><img src="animated.gif" alt="Flattening example using two cubes" class="figure"></p>
<p>As an overview for anyone who wants to implement this or a similar routine I’ve summarized it using pseudocode:</p>
<pre><code class="language-plain">routine draw_polygon(right_edges, color):
foreach edge in right_edges:
update outline, min_y and max_y using line drawing routine
round min_y and max_y to tiles
for each row within min_y ... max_y:
min_fringe = min(fringe[y] for y in current row)
max_fringe = max(fringe[y] for y in current row)
max_outline = max(outline[y] for y in current row)
round min_fringe, max_fringe and max_outline to tiles
for each column within min_fringe ... max_fringe:
column_line_table = line_table + column * 8 entries
if nametable[column, row] is solid:
old_color = color of nametable[column, row]
pattern = nametable[column, row] = alloc_pointer
increment alloc_pointer
old_pixels = color_table[old_color]
new_pixels = color_table[color]
for y in current row:
mask = column_line_table[fringe[y]]
pattern[y] =
(new_pixels & mask) | (old_pixels & ~mask)
else:
pattern = nametable[column, row]
new_pixels = color_table[color]
for y in current row:
old_pixels = pattern[y]
mask = column_line_table[fringe[y]]
pattern[y] =
(new_pixels & mask) | (old_pixels & ~mask)
for each column within max_fringe ... max_outline:
nametable[column, row] = solid pattern for color
for y in current row:
fringe[y] = outline[y]
</code></pre>
<p>The actual implementation consists of around 600 lines of 68000 assembly, making some use of macros and repeat statements.
The preprocessor was implemented in <a href="https://www.rust-lang.org/">Rust</a>, which I can highly recommend.
The implementation and fine-tuning took somewhere around 3 weeks plus some evening coding.
Most of it was done during the last summer.
Coming up with and improving the concept behind this was done on and off over many years, not targeting the mega drive in particular.</p>
<p>Working on and releasing Overdrive 2 was an awesome experience.
I want to thank everyone involved for making this possible.</p>