This section calculates the Kullback–Leibler divergence (7) for the two statistical models of interest. With an application of (1) and (3), observe that
$$\begin{aligned} \frac{f_X(t)}{f_Y(t)} = \frac{\alpha \beta ^\alpha }{\lambda } e^{\lambda t} (t+\beta )^{-\alpha -1}, \end{aligned}$$
(10)
from which it follows by applying logarithms to (10) and substituting the result into (7) that
$$\begin{aligned} D_{KL}(X || Y) = \log \left( \frac{\alpha \beta ^\alpha }{\lambda }\right) + \lambda \mathbb {E}(X) - (\alpha +1) \mathbb {E}(\log (X+\beta )), \end{aligned}$$
(11)
where \(\mathbb {E}\) is the statistical mean with respect to the distribution of X, and the fact that the density of X integrates to unity has been applied.
The mean of X can be shown to be
$$\begin{aligned} \mathbb {E}(X) = \frac{\beta }{\alpha -1}, \end{aligned}$$
(12)
with the proviso that \(\alpha > 1\), while the mean of \(\log (X+\beta )\) is given by
$$\begin{aligned} \mathbb {E}(\log (X+\beta )) = \int _0^\infty \log (t+\beta ) \frac{\alpha \beta ^\alpha }{(t+\beta )^{\alpha +1}}dt. \end{aligned}$$
(13)
By applying a transformation \(u = \log (t+\beta )\), followed by integration by parts, it can be shown that (13) reduces to
$$\begin{aligned} \mathbb {E}(\log (X+\beta )) = \log (\beta ) + \frac{1}{\alpha }. \end{aligned}$$
(14)
An application of (12) and (14)–(11) demonstrates that the Kullback–Leibler divergence reduces to
$$\begin{aligned} D_{KL}(X||Y) = \log \left( \frac{\alpha }{\lambda \beta }\right) + \frac{\lambda \beta }{\alpha -1} - \left( \frac{\alpha +1}{\alpha }\right) . \end{aligned}$$
(15)
Figures 1 and 2 plot the Kullback–Leibler divergence (15) as a function of \(\lambda\), for a series of Pareto shape and scale parameters. Each figure shows curves for a specified \(\beta\), with \(\alpha \in \{5, 10, 15, 20, 25, 30\}\). Figure 1 is for the case where \(\beta = 0.1\) (left subplot) and \(\beta = 0.5\) (right subplot). Figure 2 is for \(\beta = 0.95\) (left subplot) and \(\beta = 10\) (right subplot). These figures show a common structure to the Kullback–Leibler divergence. In particular, for each \(\alpha\) and \(\beta\) there exists a \(\lambda\) which minimises (15). It is also interesting to observe the effect \(\beta\) has on the Kullback–Leibler divergence. For a target upper bound of approximately \(10^{-3}\) on the Kullback–Leibler divergence, it is clear from Fig. 1 (left subplot) that for \(\alpha \approx 30\), one must select \(\lambda \approx 300\). For the case of \(\beta = 0.5\), Fig. 1 (right subplot) suggests that \(\alpha \approx 30\) and \(\lambda \approx 50\). For the case of \(\beta = 0.95\), Fig. 2 (left subplot) suggests that \(\alpha \approx 30\) and \(\lambda \approx 30\). Finally, as shown in the right subplot of Fig. 2, for \(\beta = 10\) we require \(\alpha \approx 30\) and \(\lambda \approx 3\).
To understand this mathematically, differentiating (15) with respect to \(\lambda\) yields
$$\begin{aligned} \frac{\partial D_{KL}(X||Y)}{\partial \lambda } = -\frac{1}{\lambda } + \frac{\beta }{\alpha -1}, \end{aligned}$$
(16)
which is zero when \(\lambda = \frac{\alpha -1}{\beta }\), where it is necessary to assume \(\alpha > 1\). Applying a second differentiation to (16) shows that this is a point where a minimum occurs. This explains the phenomenon observed in these plots.
In order to investigate these results further, Figs. 3 and 4 plot a series of Pareto distributions, together with the optimal Exponential approximation. Here optimal is used in the sense that the Kullback–Leibler divergence is minimised with an appropriate selection of Exponential distribution shape parameter. Figure 3 (left subplot) is for the case where the Pareto scale parameter is \(\beta = 0.1\), with shape parameter varying from 5, 15 to 30. It can be observed that as the Pareto shape parameter increases, the optimal Exponential distribution is a better fit. This is consistent with the results illustrated in Fig. 1. Figure 3 (right subplot) is for the case where \(\beta = 0.5\), Fig. 4 (left subplot) corresponds to \(\beta = 0.95\) and Fig. 4 (right subplot) is for \(\beta = 10\). Observe in all figures that for \(\alpha = 5\), the approximation is poor, while for \(\alpha = 15\) the approximation has improved significantly. When \(\alpha = 30\) it is very difficult to see a difference between the two distributions.
Returning to the analysis of the minimum achievable divergence, by applying \(\lambda = \frac{\alpha -1}{\beta }\) to (15), it can be shown that the minimum divergence is
$$\begin{aligned} D_{KL}(X||Y)_{\min } = \log \left( 1 + \frac{1}{\alpha -1}\right) - \frac{1}{\alpha }. \end{aligned}$$
(17)
Since for any \(x>0\) we have the bound \(\log (1+x) \le x\), an application of this to (17) yields the upper bound
$$\begin{aligned} D_{KL}(X||Y)_{\min } \le \frac{1}{\alpha (\alpha -1)}. \end{aligned}$$
(18)
An application of (18)–(8) results in
$$\begin{aligned} \Vert F_X - F_Y\Vert _{\infty } \le \sqrt{\frac{2}{\alpha (\alpha -1)}}. \end{aligned}$$
(19)
One can compare the upper bound provided by (19) to that obtained via Stein’s Method, given by the upper bound \(\frac{3}{\alpha }\) in (6). With an application of some simple analysis one can show that the upper bound based upon (19) improves on that from (6) whenever \(\alpha ^2 - \frac{9}{7}\alpha > 0\). This occurs when \(\alpha > \frac{9}{7}\), and since in most cases \(\alpha > 2\), as shown in Weinberg (2011a), it follows that the upper bound attained by the Kullback–Leibler divergence is smaller than that obtained with Stein’s Method.
To illustrate the differences between the upper bounds, Fig. 5 (left subplot) plots the two upper bounds as a function of \(\alpha\). It can be observed that the upper bound (19) is better than that based upon (6).
Using a similar analysis it can be shown that the Stein lower bound, namely \(-\frac{1}{\alpha -1}\), tends to be closer to zero than that obtained by the Kullback–Leibler divergence, as illustrated in the right subplot of Fig. 5.