Let us consider the SAT for an arbitrary CNF C. The partitioning of C is a set of formulas
$$\begin{aligned} C\wedge G_j,\quad j\in \{1,\ldots ,s\} \end{aligned}$$
such that for any \(i,j:i\ne j\) formula \(C \wedge G_i \wedge G_j\) is unsatisfiable and
$$\begin{aligned} C\equiv C \wedge G_1 \vee \cdots \vee C \wedge G_s. \end{aligned}$$
(hereinafter by “\(\equiv\)” we denote logical equivalence). Obviously when one has a partitioning of the original SAT instance, SAT for formulas \(C \wedge G_j\), \(j\in \{1,\ldots ,s\}\) can be solved independently in parallel.
There exist various partitioning techniques. For example one can construct \(\{G_j \}_{j=1}^s\) using a scattering procedure, a guiding path solver, lookahead solver and a number of other techniques described in Hyvärinen (2011). Unfortunately, for these partitioning methods it is hard in general case to estimate the time required to solve an original problem. From the other hand in a number of papers about SAT-based cryptanalysis of several keystream ciphers there was used a partitioning method that makes it possible to construct such estimations in quite a natural way. In particular, in Eibach et al. (2008), Soos et al. (2009), Soos (2010), Zaikin and Semenov (2008) for this purpose the information about the time to solve small number of subproblems randomly chosen from the partitioning of an original problem was used. In our paper we give strict formal description of this idea within the borders of the Monte Carlo method in its classical form (Metropolis and Ulam 1949). Also we focus our attention on some important details of the method that were not considered in previous works.
Consider SAT for an arbitrary CNF C over a set of Boolean variables \(X=\{x_1,\ldots ,x_n\}\). To an arbitrary set \(\tilde{X}=\left\{ x_{i_1},\ldots ,x_{i_d}\right\}\), \(\tilde{X}\subseteq X\) we refer as a decomposition set. Consider a partitioning of C that consists of a set of \(2^d\) formulas
$$\begin{aligned} C \wedge G_j,\quad j\in \{1,\ldots ,2^d\} \end{aligned}$$
where \(G_j\), \(j\in \{1,\ldots ,2^d\}\) are all possible minterms over \(\tilde{X}\). Note that an arbitrary formula \(G_j\) takes a value of true on a single truth assignment \(\left( \alpha _1^j,\ldots ,\alpha _d^j\right) \in \{0,1\}^d\). Therefore, an arbitrary formula \(C \wedge G_j\) is satisfiable if and only if \(C\left[ \tilde{X} /\left( \alpha _1^j,\ldots ,\alpha _d^j\right) \right]\) is satisfiable. Here \(C\left[ \tilde{X} /\left( \alpha _1^j,\ldots ,\alpha _d^j\right) \right]\) is produced by setting values of variables \(x_{i_k}\) to corresponding \(\alpha _k^j\), \(k\in \{1,\ldots ,d\}\) : \(x_{i_1}=\alpha _1^j,\ldots ,x_{i_d}=\alpha _d^j\). To a set of CNFs
$$\begin{aligned} \Delta _C(\tilde{X})=\left\{ C\left[ \tilde{X}/\left( \alpha _1^j,\ldots ,\alpha _d^j\right) \right] \right\} _{\left( \alpha _1^j,\ldots ,\alpha _d^j\right) \in \{0,1\}^d} \end{aligned}$$
we will refer as a decomposition family produced by \(\tilde{X}\). It is easy to see that the decomposition family is the partitioning of the SAT instance C.
Let A be some SAT solving algorithm. Hereinafter we presume that A is complete, i.e. it halts on every input. We also presume that A is a non-randomized deterministic algorithm. We denote the total runtime of A on all the SAT instances from \(\Delta _C\left( \tilde{X}\right)\) as \(t_{C,A}\left( \tilde{X}\right)\). Below we suggest a method for estimating \(t_{C,A}\left( \tilde{X}\right)\).
Define the uniform distribution on the set \(\{0,1\}^d\). With each randomly chosen truth assignment \(\left( \alpha _1,\ldots ,\alpha _d\right)\) from \(\{0,1\}^d\) we associate a value \(\xi _{C,A}\left( \alpha _1,\ldots ,\alpha _d\right)\) that is equal to the runtime of A on CNF \(C\left[ \tilde{X} /\left( \alpha _1,\ldots ,\alpha _d\right) \right]\). Let \(\xi ^1,\ldots ,\xi ^Q\) be all the different values that \(\xi _{C,A}\left( \alpha _1,\ldots ,\alpha _d\right)\) takes on all the possible \(\left( \alpha _1,\ldots ,\alpha _d\right) \in \{0,1\}^d\). Below we use the following notation
$$\begin{aligned} \xi _{C,A}\left( \tilde{X}\right) =\left\{ \xi ^1,\ldots ,\xi ^Q\right\} . \end{aligned}$$
(1)
Denote the number of \(\left( \alpha _1,\ldots ,\alpha _d\right)\), such that \(\xi _{C,A}\left( \alpha _1,\ldots ,\alpha _d\right) =\xi ^j\), as \(\sharp \xi ^j\). Associate with (1) the following set
$$\begin{aligned} P\left( \xi _{C,A}\left( \tilde{X}\right) \right) =\left\{ \frac{\sharp \xi ^1}{2^d},\ldots ,\frac{\sharp \xi ^Q}{2^d}\right\} . \end{aligned}$$
We say that the random variable \(\xi _{C,A}\left( \tilde{X}\right)\) has distribution \(P\left( \xi _{C,A}\left( \tilde{X}\right) \right)\). Note that the following equality holds
$$\begin{aligned} t_{C,A}\left( \tilde{X}\right) =\sum \limits _{k=1}^Q \left( \xi ^k\cdot \sharp \xi ^k\right) =2^d\cdot \sum \limits _{k=1}^Q\left( \xi ^k\cdot \frac{\sharp \xi ^k}{2^d}\right) . \end{aligned}$$
Therefore,
$$\begin{aligned} t_{C,A}\left( \tilde{X}\right) =2^d\cdot \mathrm {E}\left[ \xi _{C,A}\left( \tilde{X}\right) \right] . \end{aligned}$$
(2)
To estimate the expected value \(\mathrm {E}\left[ \xi _{C,A}\left( \tilde{X}\right) \right]\) we will use the Monte Carlo method (Metropolis and Ulam 1949). According to this method, a probabilistic experiment that consists of N independent observations of values of an arbitrary random variable \(\xi\) is used to approximately calculate \(\mathrm {E}\left[ \xi \right]\). Let \(\zeta ^1,\ldots ,\zeta ^N\) be results of the corresponding observations. They can be considered as a single observation of N independent random variables with the same distribution as \(\xi\). If \(\mathrm {E}\left[ \xi \right]\) and \(\mathrm {Var}\left( \xi \right)\) are both finite then from the Central Limit Theorem (Feller 1971) we have the main formula of the Monte Carlo method
$$\begin{aligned} \mathrm {Pr}\left\{ \left| \frac{1}{N}\cdot \sum \limits _{j=1}^N \zeta ^j - \mathrm {E}\left[ \xi \right] \right| <\frac{\delta _\gamma \cdot \sigma }{\sqrt{N}}\right\} =\gamma . \end{aligned}$$
(3)
Here \(\sigma =\sqrt{Var\left( \xi \right) }\) stands for a standard deviation, \(\gamma\) – for a confidence level, \(\gamma =\Phi \left( \delta _\gamma \right)\), where \(\Phi \left( \cdot \right)\) is the normal cumulative distribution function. It means that under the considered assumptions the value
$$\begin{aligned} \frac{1}{N}\cdot \sum \limits _{j=1}^N \zeta ^j \end{aligned}$$
is a good approximation of \(\mathrm {E}\left[ \xi \right]\), when the number of observations N is large enough.
Due to completeness of A the expected value and variance of random variable \(\xi _{C,A}(\tilde{X})\) are finite. Since A is deterministic (i.e. it does not use randomization) the observed values will have the same distribution. One can use the preprocessing stage to estimate the effectiveness of the considered partitioning because N can be significantly less than \(2^d\).
So the process of estimating the value (2) for a given \(\tilde{X}\) is as follows. We randomly choose N truth assignments of variables from \(\tilde{X}\)
$$\begin{aligned} \alpha ^1=\left( \alpha _1^1,\ldots ,\alpha _d^1\right) ,\ldots ,\alpha ^N=\left( \alpha _1^N,\ldots ,\alpha _d^N\right) . \end{aligned}$$
(4)
Below we refer to (4) as random sample. Then consider values
$$\begin{aligned} \zeta ^j=\xi _{C,A}\left( \alpha ^j\right) ,\quad j=1,\ldots ,N \end{aligned}$$
and calculate the value
$$\begin{aligned} F_{C,A}\left( \tilde{X}\right) =2^d\cdot \left( \frac{1}{N}\cdot \sum \limits _{j=1}^N \zeta ^j\right) . \end{aligned}$$
(5)
If N is large enough then the value of \(F_{C,A}\left( \tilde{X}\right)\) can be considered as a good approximation of (2). That is why one can search for a decomposition set with minimal value of \(F_{C,A}\left( \cdot \right)\) instead of finding a decomposition set with minimal value (2). Below we refer to function \(F_{C,A}\left( \cdot \right)\) as predictive function.