Kullback–Leibler divergence and the Pareto–Exponential approximation

Recent radar research interests in the Pareto distribution as a model for X-band maritime surveillance radar clutter returns have resulted in analysis of the asymptotic behaviour of this clutter model. In particular, it is of interest to understand when the Pareto distribution is well approximated by an Exponential distribution. The justification for this is that under the latter clutter model assumption, simpler radar detection schemes can be applied. An information theory approach is introduced to investigate the Pareto–Exponential approximation. By analysing the Kullback–Leibler divergence between the two distributions it is possible to not only assess when the approximation is valid, but to determine, for a given Pareto model, the optimal Exponential approximation.

up wind direction, which is generally the most spiky. The wind speed was roughly 9 m/s and the largest wave height was roughly 3 m, so that the sea state was approximately 4. The results of the trial was conclusive evidence that at a low grazing angle, the Pareto model outperformed the Weibull, Log-Normal and K-Distributions. Additionally, the model was compared to mixtures of Weibull and K, and shown to outperform Weibull mixtures, while having comparable performance to a K-mixture model. Given the latter is a three to four parameter model, the performance of the two parameter Pareto model was determined to be excellent.
A third validation for the Pareto model has been provided by Defence Science and Technology Group (DSTG) in Australia, based upon data from their Ingara radar. Ingara is an experimental fully polarimetric airborne multi-mode X-band imaging radar developed by DSTG (Stacy and Burgess 1994), which was deployed in a Raytheon Beech 1900C aircraft during a number of trials. A trial was conducted in 2004, in the Southern Ocean near Port Lincoln in South Australia (Stacy et al. 2005). The radar operated with a frequency of 10.1 GHz, with a pulse length of 20 μs, pulse repetition frequency of 300 Hz and LFM transmitted bandwidth of 200 MHz. This permitted a range resolution of 0.75 m. Ingara operated in a circular spotlight mode, surveying the same patch of ocean at all azimuth angles (0°-360°), and over the range of grazing angles 10°-45°. Sea states varied from 2 to 5, while wind speeds varied from 6.1 to 13.2 m/s. The data gathered in this trial was analysed in blocks composed of 1024 range compressed samples of roughly 920 pulses over 5° azimuth angle increments. The Pareto fit to the Ingara clutter has been reported initially in Weinberg (2011a), then further analysed in Rosenberg and Bocquet (2013). The inclusion of receiver thermal noise in the Ingara data, together with a Pareto clutter model, has also been reported in Rosenberg and Bocquet (2015). The conclusions from these investigations was that the Pareto distribution also fitted medium to high grazing angle clutter, obtained from an airborne surveillance radar.
These three independent studies confirmed the validity of the Pareto model for X-band maritime surveillance radar clutter, regardless of the radar platform and independent of the grazing angle. Consequently much effort has been invested in the development of noncoherent detection under a Pareto clutter model assumption (Weinberg 2013a(Weinberg , 2015. The Pareto distribution also fits into the currently accepted framework for clutter models in the complex domain, since it arises as the intensity model of a compound Gaussian distribution with inverse Gamma texture (Weinberg 2011b). As a result of this, coherent radar detection schemes have been analysed extensively, based upon this clutter model assumption (Sangston et al. 2012;Shang and Song 2011;Weinberg 2013b, c). Although the Pareto model has presented radar researchers with a simpler alternative to the Weibull and K-distributions, there is still merit in applying the original detection schemes designed for target detection in Gaussian clutter, or in Exponentially distributed intensity clutter, since in some cases X-band clutter is reasonably approximated by these processes. The validity of such an approximation has been analysed in Weinberg (2012), who investigated the Exponential approximation of a Pareto distribution with Stein's Method. It was shown that relative to DSTG's Ingara radar clutter, in the case of VV-polarisation, the Exponential approximation was valid. This coincided with Pareto fits to the data which resulted in large shape parameters. Stein's Method was used to construct explicit bounds to quantify this observation.
The current paper is concerned with understanding the validity of the Pareto-Exponential approximation, through an analysis of the Kullback-Leibler divergence. This will be shown to not only provide a simpler estimate of the distributional difference, but also will indicate how an optimal Exponential distribution can be selected for any given Pareto model. Numerical comparisons are used to demonstrate the validity of the approach.

Pareto and Exponential distributions
Before proceeding with the analysis of the Kullback-Leibler divergence, a brief overview of the relevant distributions is undertaken. A useful reference which contains details of these distributions is Beaumont (1980). A random variable X has a Pareto distribution with shape and scale parameters α > 0 and β > 0 respectively if its probability density function is and its cumulative distribution function where t ≥ 0 and P denotes probability. Similarly, a random variable Y with shape parameter > 0 has an Exponential distribution if its density is given by and cumulative distribution function also for t ≥ 0. One of the fundamental differences between these distributions is the existence of moments. For the Pareto distribution, the existence of moments depends on the magnitude of its shape parameter, while for the Exponential distribution such moments always exist (Beaumont 1980). The problem of interest is to understand when (4) is a reasonable approximation for (2). Since, on the basis of empirical studies such as Weinberg (2011a), this will happen as the Pareto shape parameter increases, it will be assumed that α ≫ 1 throughout without loss of generality.
The Exponential distribution arises as a limit of the Pareto as the latter's shape parameter increases. This can be seen through a reparameterisation of β = α(1+o α (1)) where o α (1) → 0 as α → ∞. Applying this to the complementary distribution function of X yields from which it can be concluded that the distribution function of X limits to that of Y as the Pareto shape parameter increases without bound.
The limit (5) can be quantified by establishing bounds on the distributional differences. Towards this aim, Stein's Method (Barbour and Chen 2005) can be used to measure the rate of convergence of (2) to a limiting distribution of the form (4). This method starts with a differential equation characterising the Exponential distribution, and bounds on this are then used to measure the rate of convergence. In particular, it is shown in Weinberg (2012) that the two distributions above satisfy the inequality which shows that the rate of convergence is controlled by the Pareto shape parameter. It is clear that as α increases, the bounds in (6) decrease to zero, implying the Exponential approximation to the Pareto model is valid for large shape parameters.
The problem with the Stein approach is that the bounds do not suggest a suitable way in which, for a given Pareto model, an appropriate approximating Exponential distribution can be specified. This can be rectified with an application of the Kullback-Leibler divergence as an alternative to analysing distributional approximations.

Information theory
Information theory is concerned with the study of entropy as a measure of uncertainty, and was introduced into the engineering community by Shannon (1948), and has had a profound effect on the understanding and optimisation of data networks (Arndt 2004). In particular, the Kullback-Leibler divergence, introduced in Kullback and Leibler (1951), has found application in signal processing analysis and statistical model fitting (Hulle 2005;Seghouane 2006;Youssef et al. 2016;Wenling and Yingmin 2016).
The Kullback-Leibler divergence is a measure of the information lost when one distribution is approximated by another. Hence, for two random variables X and Y with densities f X and f Y , the information lost when Y is used to approximate X is defined to be where it has been assumed that these two random variables have support the nonnegative real line. It can be shown that (7) is the difference between the cross entropy of X and Y, and the entropy of X (Arndt 2004). Since (7) measures the information lost in the approximation of X by Y, it can be used to assess the convergence of these distributions.
It can be shown that D KL (X||Y ) ≥ 0, a result known as Gibb's Inequality, which follows from Jensen's Inequality (Arndt 2004). It is clear that if the two random variables X and Y coincide then D KL (X||Y ) = 0. The converse of this can also be demonstrated to be true. However, it is clear from (7) that the Kullback-Leibler divergence is not symmetric, nor satisfies a triangle inequality. Consequently it is not a metric but is a pseudo-metric. Its value in assessing convergence in distribution follows from the Pinsker-Csiszár Inequality (Pinsker 1964;Csiszár 1967;Kullback 1967). Suppose for the two random variables X and Y their distribution functions are F X (t) and F Y (t) respectively, with support the nonnegative real line. Then this inequality states that where the norm on the left hand side of (8) is the supremum norm over the domain of the distribution functions. Clearly if the Kullback-Leibler divergence is close to zero, the supremum norm inherits this and thus implies the random variables X and Y are close in distribution. Also based upon (8), if a sequence of random variables X n is such that lim n→∞ D KL (X n ||Y ) = 0, for some random variable Y, then the limiting distribution of X n and Y coincide, which can be justified with an application of Lebesgue's Dominated Convergence Theorem.
These results justify using the Kullback-Leibler divergence to measure distributional approximations. It is worth noting that although the triangle inequality is not achievable with this divergence, it is possible to construct a measure which is symmetric. This can be produced by defining the distance which has been utilised in Seghouane (2006). However, as will be shown in the next section, it is sufficient to apply (7) to the problem under investigation.

Kullback-Leibler divergence
This section calculates the Kullback-Leibler divergence (7) for the two statistical models of interest. With an application of (1) and (3), observe that from which it follows by applying logarithms to (10) and substituting the result into (7) that where E is the statistical mean with respect to the distribution of X, and the fact that the density of X integrates to unity has been applied.
The mean of X can be shown to be with the proviso that α > 1, while the mean of log(X + β) is given by By applying a transformation u = log(t + β), followed by integration by parts, it can be shown that (13) reduces to An application of (12) and (14) Finally, as shown in the right subplot of Fig. 2, for β = 10 we require α ≈ 30 and ≈ 3. To understand this mathematically, differentiating (15) with respect to yields which is zero when = α−1 β , where it is necessary to assume α > 1. Applying a second differentiation to (16) shows that this is a point where a minimum occurs. This explains the phenomenon observed in these plots.
In order to investigate these results further, Figs. 3 and 4 plot a series of Pareto distributions, together with the optimal Exponential approximation. Here optimal is used in the sense that the Kullback-Leibler divergence is minimised with an appropriate selection of Exponential distribution shape parameter. Figure 3 (left subplot) is for the case where the Pareto scale parameter is β = 0.1, with shape parameter varying from 5, 15 to 30. It can be observed that as the Pareto shape parameter increases, the optimal Exponential distribution is a better fit. This is consistent with the results illustrated in Fig. 1.    (6). With an application of some simple analysis one can show that the upper bound based upon (19) improves on that from (6) whenever α 2 − 9 7 α > 0. This occurs when α > 9 7 , and since in most cases α > 2, as shown in Weinberg (2011a), it follows that the upper bound attained by the Kullback-Leibler divergence is smaller than that obtained with Stein's Method.
To illustrate the differences between the upper bounds, Fig. 5 (left subplot) plots the two upper bounds as a function of α. It can be observed that the upper bound (19) is better than that based upon (6).
Using a similar analysis it can be shown that the Stein lower bound, namely − 1 α−1 , tends to be closer to zero than that obtained by the Kullback-Leibler divergence, as illustrated in the right subplot of Fig. 5.

Conclusions
The Kullback-Leibler divergence was used to assess the discrepancy between the Pareto and Exponential distributions, in order to better understand the validity of the Exponential approximation of the Pareto model. It was shown that for any given Pareto model an . optimal Exponential approximation exists. This approximation was shown to improve as the Pareto shape parameter increased, for any fixed Pareto scale parameter. This means that in cases where in X-band maritime surveillance radar the Pareto shape parameter exceeds 30, it is acceptable to apply detection schemes based upon an Exponential clutter model assumption.