Refactoring Varisat: 1. Basics and Parsing

Posted on February 3rd 2019, tagged refactoring-varisat, rust, sat

This is the first post in a series of posts I plan to write while refactoring my SAT solver varisat. In the progress of developing varisat into a SAT solver that can compete with some well known SAT solvers like minisat or glucose, it accumulated quite a bit of technical debt. Varisat is the first larger project I’ve written in rust and there a lot of things I’d do differently now. Before I can turn varisat into a solver that competes with the fastest solvers out there, I need to do some refactoring.

My current plan is to start a new project from scratch copying over bits that I want to keep and rewriting parts that I don’t. That’s usually my preferred way to refactor when I plan to change the overall architecture. The new version will be varisat 0.2, hopefully turning into varisat 1.0.

This refactoring also gives me the chance to write this series of posts, which should make it much easier to understand the code base and contribute to varisat. I’m also moving varisat to GitHub and stable rust which should make collaboration within the rust open source ecosystem easier.

I also want to use my new library partial_ref which should result in way less fighting the borrow checker. Currently in varisat there are a lot of functions that take way too many parameters, mostly references to different data structures. This is caused by the borrow checker being not flexible enough across function calls. My partial_ref library offers a workaround that I think will be an improvement compared to the workarounds I’ve been using before.

CNF Formulas

SAT solvers determine whether a Boolean formula can be satisfied. They either find an assignment (also called interpretation) of the formula’s variables so that the formula is true, or produce a proof that this is impossible. Usually SAT solvers require the input to be in conjunctive normal form (CNF). This means that the formula is a conjunction (Boolean and) of clauses, where a clause is a disjunction (Boolean or) of literals and a literal is a variable or a negated variable. An assignment satisfies a formula in CNF precisely when at least one literal of each clause is true.

SAT solvers require the input to be in CNF as this is the internal representation used. This isn’t a big restriction though, as it is possible to turn any Boolean formula into an equisatisfiable formula in CNF with only linear overhead by introducing new variables. Equisatisfiable means that either both formulas are satisfiable or both are not. This is weaker condition than equivalence which means that exactly the same assignments satisfy both formulas. Here equisatisfiability allows the introduction of new helper variables.

In varisat 0.1, a formula is directly parsed into the internal data structures of the solver. There is no standalone data type representing a CNF formula. Such a data type isn’t very useful inside the solver. More specialized data structures are used there. Nevertheless, I think such a type would be useful when using varisat as a library. It makes it easier to re-use the parser or write other code that processes CNF formulas.

To represent a formula in CNF we need a type for variables and for literals. Variables are indexed using integers and are represented by their index. This is also the encoding used by the standard CNF file format (DIMACS CNF). Internally for varisat the first variable has index 0, while in DIMACS CNF the first variable has index 1. For everything user facing the 1-based DIMACS CNF encoding will be used.

A literal is represented by their variable’s index and a flag that tells us whether the literal is negated or not. In DIMACS CNF the flag is represented by negating the variable index. That doesn’t work for the variable with index 0 though. So to avoid that problem, internally literals use the least significant bit as a negation marker, shifting the variable index one bit to the left.

To save on memory usage and bandwidth, literals and variables are stored in 32-bit. This limits the number of variables to $2^{31}$ . For now the actual limit in varisat is set quite a bit below $2^{31}$ , leaving room for further flags or sentinel values. As far as I know, most SAT solvers do this.

For variables and literals I’m quite happy with the existing code from varisat 0.1, providing the types Var and Lit, so I added it almost verbatim to the new project.

Equipped with literals we can now implement a type to store a CNF formula. It would be possible to just use a Vec<Vec<Lit>>, but that requires an allocation for each clause. Given that formulas with millions of clauses are used in practice, that doesn’t sound so good. Instead I’m going to use a struct with a Vec<Lit> containing the literals for all clauses and a Vec<Range<usize>> containing the range where each clause’s literals are stored. You can see the implementation here.

Parsing

In varisat 0.1 I tried to make the parser as forgiving as possible. It also completely ignored the header line. I think that was a mistake. It added complexity, causing a long standing bug when parsing empty clauses and made it less likely to detect formulas that somehow got truncated. I still don’t require a header, but if one is present and the formula doesn’t match the header, at least a warning is generated.

The parsing code itself is a hand-rolled parser that is largely based on the one in varisat 0.1. You can feed it chunks (byte slices) of input data. While parsing a chunk, clauses are added to a parser internal CNF formula. At any point it is possible to retrieve the clauses parsed so far, clearing the internal CNF formula. This allows for incremental parsing, which is useful for the solver with its own clause database, but also allows for parsing a complete file into a single CNF formula, useful for various utilities. Varisat 0.1 also used incremental parsing, but instead of the caller asking for the clauses parsed so far, it required a callback to process clauses. Combined with error handling that approach wasn’t nice to use.

Also new is some code to write a CNF formula back into a file. This is useful for writing various CNF processing utilities, but even more important for testing.

Testing

While varisat 0.1 had some tests, my plan is to get much better test coverage for varisat 0.2. I had quite a few bugs that stayed undetected way too long. To make testing easier I’ll be using property based testing for varisat 0.2. I’ve been using property based testing before, for example using QuickCheck in Haskell or using Hypothesis in Python. Recently I’ve discovered the excellent proptest crate, which is inspired by Hypothesis.

Property based testing changes the focus from individual test cases to more general properties. You specify a set of values, using combinators provided by the library, and some property that should hold for those values. The property is written as normal rust code using assertions. The library will then sample lots of values matching your specification and test them against your property, often finding non-obvious corner cases in the process. In a way it is a hybrid between fuzzing and unit tests. When a counterexample is found, propcheck also systematically tries to find simpler counterexamples by shrinking the values. It also saves counterexamples for future regression testing.

The CNF parser and writer are a good example for this. By generating random CNF formulas, writing them, parsing them back and comparing the result, a lot of code paths are exercised and tested with very little effort.

I usually try to write a generic property test first, and then use a code coverage tool to identify what else needs to be tested. In the case of this parser what was left was mostly the checks for invalid syntax and integer overflows.

What’s next?

The next thing to add is a clause database, probably followed by unit propagation. After that there will be conflict analysis, branching heuristics, learned clause minimization, glue computation, database reduction, database garbage collection, restarts and proof generation, although not in that order. If I’m done with that I’ll be roughly at a point where I am now with varisat 0.1, but hopefully with a much cleaner code base.

I plan to cover the complete refactoring process on this blog. I’m not sure how regular and in what detail, but my goal is that someone not familiar with SAT solver internals can use this series of posts as a starting point for hacking on varisat.

If you don’t want to miss future posts, you can subscribe to the RSS feed or follow me on Twitter.