Evaluation data for the CADE submission 3160 (from the 2022 SMT competition)

LRA.xlsx

NRA.xlsx

LIA.xlsx

NIA.xlsx

BV.xlsx

Note that there are minor discrepancies with the results presented on the SMT competition website:

For BV, the website reports Q3B solving 832 instances and Q3B-pBDD solving 753 instances; we count 835 and 754, respectively.
For LIA, the website reports UltimateEliminator solving 228 instances; we count 230.
For NIA, the website reports Vampire solving 63 instances; we count 66.

All other numbers match.

We are investigating why there are these minor differences. Note that none of them concerns YicesQS, Z3, or CVC5, and they do not change the general landscape of the results and our evaluation.

Also note that the website shows a column entitled "CPU Time Score", which is different from the cumulative runtimes we present in our submission. In our submission, we indicate, for each logic and each solver, the total time that the solver took to solve the instances that it did solve. The "CPU Time Score" on the website is the total time that the solver took over all instances used for the logic, including the instances the solver did not solve. Our choice is motivated by the fact that the "CPU Time Score" is often dominated by the number of timeouts; for instance in NRA, YicesQS has spent 97% of the "CPU Time Score" of 6164s on 5 instances that timed out and 3% (165s) on the 94 instances it did solve. If instead of timing out, the solver had crashed, run out of memory, or if it had decided to give up after less than the 20 minute timeout, the "CPU Time Score" could be massively decreased. We therefore believe that score does not have great significance. For instance in NRA, Z3 solves the same number of instances as YicesQS, namely 94 out of 99, in more time (315s instead of YicesQS's 165s), but its "CPU Time Score" of 5004 is lower than YicesQS's 6164s, simply because on 2 instances out of the 5 instances it cannot solve, Z3 chooses to give up the search early instead of trying until time runs out (20 minutes). Therefore using the "CPU Time Score" as a metric gives bonuses to crashes, running out of memory, and giving up early as a gamble about whether using the whole 20 minutes may not help the solver produce an answer.