Refactoring Varisat: 4. Heuristics
This is the fourth post in my series about refactoring varisat. In the last post we saw how conflict driven clause learning works, in this post we’re going to make it fast. To get there we add several smaller features that were already present in varisat 0.1. While there are still some things missing that varisat 0.1 supports, these are features like proof generation or incremental solving that don’t affect the solving performance.
This means you can solve some non-trivial instances using varisat 0.2 today. I haven’t made a new release on crates.io yet, but you can install it directly from github (replacing any previous version installed):
cargo install --force --git https://github.com/jix/varisat
One part of making the solver faster were low-level optimizations in the unit-propagation implementation. I won’t go into the details of that, but you can look at the new code. The bigger part were several heuristics that play a big role in making CDCL perform well. Below I will explain each implemented in varisat so far.
Branching: Variable Selection
Whenever unit propagation finishes but there are still unassigned variables left we need to make a guess. So far we used the first unassigned variable we found, which is not a very good strategy. It turns out that it helps to quickly produce conflicts and this is something to optimize for. It’s also important that the strategy is fast to compute.
The heuristic I’ve implemented is called “Variable State Independent Decaying Sum” or VSIDS. The idea is to keep an activity score for each variable which is updated after every conflict. When it’s time to decide on an unassigned variable to branch on, the one with the highest activity score is picked. Originally this heuristic kept scores for each possible literal, i.e. also guessed which polarity to pick. We will see that there is a better heuristic for selecting the polarity, so VSIDS is only used to select a variable.
The updating of activity scores is based on the observation that variables that were involved in recent conflicts are likely to be involved in future conflicts. A variable is involved if it is in the learned clause or was resolved during conflict analysis. There are multiple explanations contributing to this behavior, although I don’t think there is a simple one that can explain how well this strategy works. A part of it certainly is that for a given formula some variables are more likely than others to cause conflicts. On top of that there is an emergent behavior where the SAT solver focuses on a specific part of the search space. By picking active variables, those are more likely to be in future learned clauses, thus more likely to propagate, more likely to be involved in a future conflict and thus more likely to keep a high activity score. Other heuristics described later on also contribute to this.
Initially all activity scores are at zero. On conflict, the scores of the involved variables are incremented by one. This is called bumping. Afterwards the scores of all variables are multiplied by a constant decay factor smaller than 1. This gives more weight to recent bumps.
This is implemented efficiently by doing two things. The first is to keep the unassigned variables in a max-heap ordered by their activity score. This allows us to quickly identify the highest scoring variable. We allow assigned variables in the heap but make sure that no unassigned variable is missing. When we need to select a variable we pop variables until we find an unassigned one. During backtracking we need to re-insert all variables that become unassigned.
The decay operation scales all activities by the same value, this doesn’t change their relative order, so the heap stays valid. When a variable is bumped the relative order can change, so we need to update its position in the heap. This means we need to use a heap that supports updating of keys, which many variants do.
Using a heap to find the maximum score might seem useless when we’re touching every activity score anyway. This is where the second optimization comes in. Instead of multiplying all activity scores we keep a single scaling factor common to all variables. To bump a variable, we add this scaling factor instead of 1. To decay all values we simply divide the common scaling factor by the decay factor. For the heap operations we can ignore the scaling factor as it is the same for all values. This is much faster but will, at some point, overflow the floating point values used. To avoid this, we detect when the scaling factor becomes too large and apply it to all activity scores before resetting it back to 1. This happens rather infrequently.
Branching: Phase Saving
Now we know how to select a decision variable when branching. We still have to chose whether to assign true or false. Here a very simple but also very effective heuristic is used. We always choose the polarity that the variable had when it was last assigned. This is called phase saving.
I don’t have a great explanation for why this is a good strategy. It is sometimes compared to stochastic local search methods. These are incomplete algorithms for solving SAT that always keep a full assignment and repeatedly flip variables to decrease the number of unsatisfied clauses. I think it is interesting that here a strategy that tries to avoid conflicts is used. My current understanding is that this causes the solver to focus on finding a solution close to the saved assignment while the variable selection ensures that it will quickly learn why the current saved assignment might not work. This is something where I want to run more experiments to better understand the behavior.
While these heuristics are a lot better than arbitrary choice, they are still not perfect. The runtime of a recursive search like this is very sensitive to the decisions made on the outer levels. For example if we have a satisfiable instance and make a wrong guess for the first decision, we will only undo that decision if we can prove that the guess was wrong. This could take a lot of steps, even if the opposite choice would directly satisfy the formula. Unsatisfiable instances show a similar behavior.
To limit the time spend solving with a potentially bad initial guess, from time to time the solver backtracks all the way. This is called a restart. This isn’t equivalent to starting over. As we keep the learned clauses, the variable activities and use phase saving there is a lot of useful information kept.
Now we need to decide how often such a restart should happen. In varisat I’m using a popular strategy based on the Luby sequence. The idea here is to use a random variable to describe the time it would take to solve the instance without restarts. With each restart we take a sample of this random variable. If the result is smaller than the time until the next restart we solve the instance. Using successive values of the Luby sequence as times between restarts is good, no matter how the random variable is distributed. Asymptotically, it is at most a constant factor slower than any other strategy that is independent of the distribution and at most a logarithmic factor slower than any other strategy at all.
Now it isn’t quite true that restarts correspond to independent samples of a random variable, but in practice it’s still a good strategy and easy to implement as it is a static restart schedule. For some instances though, different restart strategies that are adaptive help a lot, so I will look into this in the future.
Clause Database Reduction
In the last post I already mentioned that learning more and more clauses causes unit propagation to become slower and slower. To stop it from grinding to a halt we need to remove some learned clauses from time to time. It turns out to be useful to do this rather aggressively. Nevertheless we do want to keep clauses that are useful. This means we need to have a heuristic that allows us to identify those clauses.
Varisat uses two metrics to assess clauses. The first one is clause activity. This works very much like variable activity. It is increased whenever a clause is resolved as part of conflict analysis and decayed after a conflict. It uses the same scaling optimization but doesn’t use a heap.
The second metric used is the glue level also called literal block distance or LBD. This was introduced by the SAT solver Glucose. The idea is to measure how easily a clause becomes propagating. A first approximation would be the clause’s length. A longer clause requires more literals to be false to propagate the remaining literal. The crucial observation behind glue levels is that often some of these literals are assigned at the same time. To capture this, the literals of the clause are partitioned by the decision level in which they were assigned. The glue level is the number of partitions we get this way.
For each learned clause we store the smallest glue level observed. This is initialized when the clause is added and will be updated whenever a clause is resolved during conflict analysis and happens to have a smaller glue level. Note that Glucose computes glue levels that are one larger than what I’m using. This is because during backtracking the glue level of the newly learned clause changes when the assignment of the UIP flips and the clause goes from unsatisfied to propagating. Glucose uses the glue level during the conflict. I find it easier to reason about glue levels when they are defined for propagating clauses so I used that instead.
The way we use these metrics to reduce the number of learned clauses roughly follows an approach described by Chanseok Oh. The glue level is the main criterion for clause database reduction. Learned clauses are partitioned into three tiers depending on their glue level. Clauses with a glue of 2 or lower are the “Core” tier, clauses with a glue from 3 to 6 the “Mid” tier and the remaining ones the “Local” tier. When a clause’s glue level is reduced it can move to a different tier. Clauses in the Core tier are considered useful enough to be kept indefinitely. Half of the clauses in the Local tier are removed every 15,000 conflicts. For the Mid tier we keep track which clauses were involved in a conflict. This is just a single bit per clause that we set at the same time that we’re bumping the clause’s activity. Every 10,000 conflicts all clauses from the Mid tier that were not involved in a conflict are demoted to the Local tier.
While reducing the clause database we also make sure to never delete a clause that is currently propagating as this would invalidate the implication graph.
Recursive Clause Minimization
In the last section I argued that a longer clause can be more useful than a shorter one if it is more likely to propagate. In contrast to that, a clause that consists of a subset of literals of another clause is always more useful. Whenever the longer of these propagates the shorter will too, but not the other way around.
Often when learning a clause from a conflict, there is shorter valid clause that is a subset of the initially computed clause. This happens when setting some of the clause’s literals to false implies that other literals of the clause are false. In that case those implied literals cannot satisfy the clause unless the other literals already do. Thus these literals are redundant and it is safe to remove them.
Recursive clause minimization tries to find such implications in the implication graph after conflict analysis. Apart from the UIP, each literal in the learned clause is tested for redundancy. This is done by performing a search following implications backwards. When a decision literal is hit, the search is aborted and the literal is not redundant. If another literal of the clause is hit, the search doesn’t expand that literal, but continues. When the search finishes without hitting a decision literal the literal we started with is redundant. I think this search is the reason for the name, although the search is implemented with a loop and explicit stack and not recursively.
The implementation makes a few optimizations that I won’t go into here, but that I explained in detail in the comments.
Unit Clause Simplification
Finally we can simplify our formula when there are unit clauses present. Whenever top-level unit propagation finishes, the variables assigned can be removed from the formula. If a clause contains a literal that is assigned false, that cannot satisfy the clause anymore and thus that literal can be removed from the clause. If a clause contains a literal that is assigned true, it already is satisfied and we delete the whole clause.
To get feature parity with varisat 0.1 I still need to add incremental solving and proof generation. This means there are still one or two posts left to complete this series.