Tagged compression, rust, sat

Cnfpack is a compressed file format for CNF formulas that I designed. Converting a DIMACS CNF formula to and from Cnfpack maintains the order of clauses as well as the order of literals within clauses. Comments, optional whitespace or leading zeros in the DIMACS input are not stored.

On typical CNF formulas, such as those used in the yearly SAT Competitions or found in the Global Benchmark Database, cnfpack reaches much better compression ratios than generic compressors, often by orders of magnitude, while still being very fast during decompression. Compression is much slower, so this is mostly intended for benchmarks that are reused many times.

You can find cnfpack’s on GitHub and on crates.io. If you have installed a Rust toolchain including cargo and have ~/.cargo/bin in your path, you can install cnfpack using cargo install cnfpack.

Example Usage

# Download example instance
wget -nv -O bvsub_12973.smt2.cnf.xz \
# Decompress `xz` file
xz -dk bvsub_12973.smt2.cnf.xz
# Convert to `cnfpack`
cnfpack bvsub_12973.smt2.cnf bvsub_12973.smt2.cnfpack
# Check file sizes
du -bh bvsub_12973.smt2.{cnf,cnf.xz,cnfpack}
#> 20M  bvsub_12973.smt2.cnf
#> 1.5M bvsub_12973.smt2.cnf.xz
#> 2.2K bvsub_12973.smt2.cnfpack
# Decompress and compute GBD hash to verify the formula
cnfpack -d bvsub_12973.smt2.cnfpack | tail +2 | head -c -1 | tr '\n' ' ' | md5sum
#> 5fb0d1f02c02c6a7fb485707b637d7e4 -
# ^ Matches the hash in the download URL

Note the different units for the file sizes.

Approach Used

Cnfpack works by preprocessing the input, encoding it in binary and compressing the resulting data using zstd. I noticed that typical structured CNF formulas often repeat the same structures many times with different variable IDs, so the preprocessing aims to turn this into literal repititions.

This is done by using a delta-encoding where each ID is encoded as a delta to the nearest ID seen within a small moving window. This turns every ID into an (offset, delta) pair. Before compressing using zstd I’m storing all offsets and all deltas non-interleaved in two separate arrays, with a third array storing the clause lengths. This allows exploiting repetitions that are just present in a subset of those three streams, which in practice saves more than is lost by having to encode three repetitions if all streams repeat.

As this approach has been effective for a wide range of different CNF benchmarks, I would expect this form of preprocessing to also work for other formats that represent graphs or hypergraphs with partially repetitive structures using sequential IDs.