Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search

Mei, Gang; Xu, Nengxiong; Xu, Liangliang

doi:10.1186/s40064-016-3035-2

Research
Open access
Published: 22 August 2016

Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search

SpringerPlus volume 5, Article number: 1389 (2016) Cite this article

2329 Accesses
21 Citations
Metrics details

Abstract

This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.

Introduction

Spatial interpolation is a fundamental tool in Geographic Information System (GIS). The most frequently used spatial interpolation algorithms include the Inverse Distance Weighting (IDW) (Shepard 1968), Kriging (Krige 1951), Discrete Smoothing Interpolation (DSI) (Mallet 1989, 1992), nearest neighbors, etc; see a comparative survey investigated by Falivene et al. (2010). When applying those interpolation algorithms for large-scale datasets, the computational cost is in general too high (Huang and Yang 2011). A common and effective solution to the above problem is to perform the interpolating in parallel. Currently, a number of research efforts have been conducted to parallelize the spatial interpolation algorithms on various parallel computing platforms (Shi and Ye 2013).

For example, in order to speed up the Kriging interpolation method, Pesquer et al. (2011) designed an effective solution to parallelizing the ordinary Kriging by exploiting the MPI (Message Passing Interface) libraries in a High Performance Computing environment, and significantly improved the computational efficiency of the entire process. Similarly, Strzelczyk and Porzycka (2012) presented a new parallel Kriging algorithm to deal with unevenly spaced data. Cheng (2013) proposed an efficient parallel scheme to accelerate the universal Kriging algorithm on the NVIDIA CUDA platform by optimizing the compute-intensive steps in the Kriging algorithm, such as matrix–vector multiplication and matrix–matrix multiplication and achieved a nearly 18 speedup over the serial program.

Allombert et al. (2014) introduced an efficient out-of-core algorithm that fully benefited from graphics cards acceleration on a desktop computer, and found that it was able to speed up Kriging on the GPU with data four times larger than a classical in-core GPU algorithm, with a limited loss of performances.

To improve the computational efficiency of the most time-consuming steps in ordinary Kriging, i.e., the calculating of weights and then the prediction of each unknown point, Ravé et al. (2014) investigated the potential strategy for reducing the computational cost by by employing suitable operations involved in those steps to be parallelized by using general-purpose computing on GPUs and CUDA.

Hu and Shu (2015) proposed an improved coarse-grained parallel algorithm to accelerate ordinary Kriging interpolation in a homogeneously distributed memory system using the MPI (Message Passing Interface) model and achieved the speedups of up to 20.8. Wei et al. (2015) proposed an algorithm based on the k-d tree method to partition a big dataset into workload-balanced child data groups, and achieved high efficiency when the datasets were divided into an optimal number of child data groups.

The IDW interpolation algorithm has been also parallelized on various platforms. For example, exemplified by a hybrid IDW algorithm to generate DEM from LiDAR point clouds, Guan and Wu (2010) designed and implemented a parallel algorithm on multi-core platforms to handle about one billion LiDAR points in approximately 12 min. Huraj et al. (2010a, b) accelerated the IDW method on the GPU for predicting the snow cover depth at the desired point.

Xia et al. (2010, 2011) attempted to map the IDW interpolation to the GPU for parallelization and proposed a GPU-based framework for geospatial analysis, which gave rise to a high computational throughput. Huang et al. (2011) explored of the implementation of a parallel IDW interpolation algorithm in a Linux cluster-based parallel GIS. Li et al. (2014) developed their IDW interpolation application uses the Java Virtual Machine (JVM) for the multi-threading functionality.

Mei (2014) developed two GPU implementations (i.e., the tiled version and the CDP version) of the standard IDW interpolation algorithm by utilizing the shared memory and the feature of CUDA Dynamic Parallelism, and found that the tiled version is about 120 and 670 times faster than the CPU version when the power parameter was specified to 2 and 3.0, respectively. Mei and Tian (2016) also evaluated the impact of several data layouts on the efficiency of GPU-accelerated IDW interpolation.

Some of the other efforts have been also carried out to parallelize other interpolation algorithms. For example, Wang et al. (2010) presented a computing scheme to speed up the Projection-Onto-Convex-Sets (POCS) interpolation for 3D irregular seismic data with GPUs. Guan et al. (2011) developed a parallel the fast Fourier transform (FFT) based geostatistical areal interpolation algorithm in a homogeneously distributed memory system using the MPI programming model. Huang et al. (2012) employed the k-d tree in nearest neighbors search to accelerate the grid interpolation on the GPU. Cuomo et al. (2013) proposed a parallel method based on radial basis functions for surface reconstruction on GPU.

The Adaptive IDW (AIDW) is an improved version of the standard IDW Shepard (1968), which was originally proposed by Lu and Wong (2008). In the AIDW it attempts to calculate the power parameter adaptively according to the spatial distribution pattern of the data points, while in the standard IDW the power parameter is a user-specified constant value. Due to the adaptive determination of the power parameter, the AIDW method can achieve much more accurate prediction results than those by the standard IDW.

In our previous work (Mei et al. 2015), we have designed and implemented a parallel AIDW algorithm on a GPU. And we have also evaluated the performance of the parallel AIDW method by comparing its efficiency with that of the corresponding serial one. We have observed that our GPU-accelerated AIDW algorithm can achieve the speedups of up to 400 for one million data points and interpolated points on single precision.

In our previous GPU implementations of the parallel AIDW method, we have found that the most computationally intensive step is the k nearest neighbors (kNN) search for each interpolated points. We have designed a straightforward method to find the k nearest neighboring data points for each interpolated point within a single thread. Although the GPU implementing using our straightforward kNN search approach can achieve satisfied computational efficiency, for example, the obtained speedups are about 100–400 on single precision, further performance improvement probably can be achieved by optimizing the kNN search.

The task of the kNN search is to find the nearest neighbors to an input query. Previous research efforts conducted on the kNN search are mainly implemented and optimized on the CPU (Sankaranarayanan et al. 2007). Recently, GPU-accelerated implementations have improved performance by utilizing the massively parallel architecture of a single GPU (Garcia et al. 2008; Leite et al. 2012; Pan and Manocha 2012; Liang et al. 2009; Huang and Yang 2011; Beliakov and Li 2012; Komarov et al. 2014; Liu and Wei 2015), multi-GPUs (Kato and Hosino 2012; Arefin et al. 2012), and GPU clusters (Dashti et al. 2013). Among those GPU-accelerated kNN search algorithms, most of them attempt to speed up the brute-force kNN search algorithm; and several of them are designed and optimized using space partitioning data structures such as grid (Leite et al. 2012), RP-tree (Pan and Manocha 2012), VP-tree (Liu and Wei 2015), and k-d tree (Beliakov and Li 2012).

In this paper, we attempt to improve the efficiency of our previous GPU-accelerated AIDW algorithm by adopting a more efficient kNN search approach. The efficient kNN search is expected to be performed in a separate stage with the use of the data structure, grid. The resulting values of the kNN search are the distances between the k nearest neighboring data points to each interpolated point. Those distances are then transferred into another stage of the AIDW to adaptively calculate the power parameter and the expected prediction value (i.e., the weighted average). To evaluate the improved parallel AIDW algorithm, we also compare its efficiency with that of our previous one introduced in Mei et al. (2015).

The rest of this paper is organized as follows. “The AIDW interpolation algorithm” section introduces the background principles of the IDW algorithm, the AIDW algorithm, and the kNN search. “The improved GPU-accelerated AIDW method” section describes the strategies and considerations for improving our previous GPU-accelerated AIDW algorithm. “Implementation details” section presents some implementation details of the improved algorithm. Some comparative experimental tests and analysis are provided in “Results and discussion” section. Finally, “Conclusion” section draws several conclusions.

The AIDW interpolation algorithm

The AIDW is an improved version of the standard IDW (Shepard 1968), which is originated by Lu and Wong (2008). The basic and most interesting idea behind the AIDW is as follows. It adaptively determines the distance-decay parameter $\alpha$ according to the spatial pattern of data points in the neighborhood of the interpolated points. In other words, the distance-decay parameter $\alpha$ is no longer a pre-specified constant value but adaptively adjusted for a specific unknown interpolated point according to the distribution of the nearest neighboring data points.

When predicting the desired values for the interpolated points using AIDW, there are typically two phases: the first one is to determine adaptively the power parameter $\alpha$ according to the spatial pattern of data points; and the second is to perform the weighting average of the values of data points. The second phase is the same as that in the standard IDW.

In AIDW, for each interpolated point, the parameter $\alpha$ can be adaptively determined according to the following steps.

Step 1

Determine the spatial pattern by comparing the observed average nearest neighbor distance with the expected nearest neighbor distance.

1.
Calculate the expected nearest neighbor distance $r_{\exp }$ for a random pattern using:
$$\begin{aligned} r_{\exp } =\frac{1}{2\sqrt{n / A} }, \end{aligned}$$
(1)
where n is the number of points in the study area, and A is the area of the study region.
2.
Calculate the observed average nearest neighbor distance $r_{obs}$ by taking the average of the nearest neighbor distances for all points:
$$\begin{aligned} r_{obs} =\frac{1}{k}\sum \limits _{i=1}^k {d_i }, \end{aligned}$$
(2)
where k is the number of nearest neighbor points, and $d_i$ is the nearest neighbor distances. The k can be specified before interpolating.
3.
Obtain the nearest neighbor statistic $R\left( {S_0 } \right)$ by:
$$\begin{aligned} R\left( {S_0 } \right) =\frac{r_{obs} }{r_{\exp } }, \end{aligned}$$
(3)
where $S_{0 }$ is the location of an interpolated point.

Step 2

Normalize the $R({S_0})$ measure to $\mu _R$ such that $\mu _R$ is bounded by 0 and 1 by a fuzzy membership function:

$$\begin{aligned} \mu _R =\left\{ {\begin{array}{ll} 0&{}\quad R\left( {S_0 } \right) \le R_{\min } \\ 0.5-0.5\cos \left[ {\frac{\pi }{R_{\max } }\left( {R\left( {S_0 } \right) -R_{\min } } \right) } \right] &{}\quad R_{\min } \le R\left( {S_0 } \right) \le R_{\max } \\ 1&{}\quad R\left( {S_0 } \right) \ge R_{\max}\end{array}} \right., \end{aligned}$$

(4)

where $R_{\min }$ or $R_{\max }$ refers to a local nearest neighbor statistic value (in general, the $R_{\min }$ and $R_{\max }$ can be set to 0.0 and 2.0, respectively).

Step 3

Determine the distance-decay parameter $\alpha$ by mapping the $\mu _{R}$ value to a range of $\alpha$ by a triangular membership function that belongs to certain levels or categories of distance-decay value; see Eq. (5).

$$\begin{aligned} \alpha \left( {\mu _R } \right) =\left\{ {{\begin{array}{ll} {\alpha _1 } &{}\quad {{0.0}\le \mu _R \le {0.1}} \\ {\alpha _1 \left[ {1-5\left( {\mu _R -0{.}1} \right) } \right] +5\alpha _2 \left( {\mu _R -0{.}1} \right) } &{}\quad {{0.1}\le \mu _R \le {0.3}} \\ {5\alpha _3 \left( {\mu _R -0{.}3} \right) +\alpha _2 \left[ {1-5\left( {\mu _R -0{.}3} \right) } \right] } &{}\quad {{0.3}\le \mu _R \le 0{.}5} \\ {\alpha _3 \left[ {1-5\left( {\mu _R -0{.5}} \right) } \right] +5\alpha _4 \left( {\mu _R -0{.}5} \right) } &{}\quad {{0.5}\le \mu _R \le {0.7}} \\ {5\alpha _5 \left( {\mu _R -0{.7}} \right) +\alpha _4 \left[ {1-5\left( {\mu _R -0{.7}} \right) } \right] } &{}\quad {{0.7}\le \mu _R \le {0.9}} \\ {\alpha _5 } &{}\quad {{0.9}\le \mu _R \le {1.0}} \\ \end{array} }} \right. , \end{aligned}$$

(5)

where the $\alpha _{1}, \alpha _{2}, \alpha _{3}, \alpha _{4}, \alpha _{5}$ are the assigned to be five levels or categories of distance-decay value.

After determining the parameter $\alpha$, the desired prediction value of each interpolated point can be obtained via the weighting average. This stage is the same as that in the standard IDW.

The improved GPU-accelerated AIDW method

This section will briefly introduce the considerations and strategies in the development of the improved GPU-accelerated AIDW interpolation algorithm.

Overview and basic ideas

The basic and most interesting concept behind the AIDW method is as follows. It attempts to determine adaptively the power parameter $\alpha$ according to the spatial distribution pattern of each interpolated point. In AIDW algorithm, the spatial distribution pattern is considered as the distribution density of several nearest neighboring data points locating around an interpolated point, which can be roughly measured by using the average distance from those neighboring data points to the interpolated point.

In our previous work, we present a straightforward, easy-to-implement, and suitable for GPU-parallelization algorithm to find the k nearest neighboring data points of each interpolated point. Assuming there are n interpolated points and m data points, for each interpolated point we carry out the following steps (Mei et al. 2015):

Step 1 Calculate the first k distances between the first k data points and the interpolated points;
Step 2 Sort the first k distances in ascending order;
Step 3 For each of the rest ($m-k)$ data points,

1.
Calculate the distance dist;
2.
Compare the dist with the kth distance: if dist < the kth distance, then replace the kth distance with the dist
3.
Iteratively compare and swap the neighboring two distances from the kth distance to the first distance until all the k distances are newly sorted in ascending order.

The major advantage of the above algorithm is that it is simple and easy to implement. Obviously, there is no need to utilize any complex space partitioning data structures such as various types of trees. In contrast, only arrays for storing distances and coordinates are needed. Also, we find the desired nearest neighbors without the use of explicit sorting algorithms such as binary search. In general, most sorting algorithms are computationally complex and not suitable for entirely being invoked within a single GPU thread.

The most obvious shortcoming of the above algorithm for finding nearest neighboring data points is the computational inefficiency that is due to the global search for nearest neighbors. In that algorithm, the first k distances are calculated and recorded; and then the distances to the rest points are calculated and then compared with those first k distances. The above procedure obviously needs a global search, which is not computationally optimal. One of the frequently used optimization strategies is to perform a local search by filtering those data points and distances that are not needed to be considered.

In this work, we focus on improving our previous GPU-accelerated AIDW algorithm by using a fast kNN search algorithm. Our considerations and basic ideas behind developing the efficient kNN search algorithm are as follows:

1.
Create an even grid to partition the planar region that encloses the projected positions of all data points and interpolated points;
2.
Distribute all the data points and interpolated points into the grid and record the locations;
3.
Perform a local and fast search within the grid to find the nearest neighboring data points for each interpolated point.

After obtaining the average distance of those neighboring data points, the adaptive power parameter $\alpha$ will be determined according to the average distance. Finally, the desired prediction value for each interpolated point can be obtained via weighting average using the parameter $\alpha$.

In summary, the improved GPU-accelerated AIDW algorithm is mainly composed of two stages: (1) the kNN search and average distances calculation, and (2) the determination of adaptive power parameter and prediction value by weighted interpolating; see Fig. 1.

Stage 1: kNN search

The workflow of the stage of kNN search is listed in Fig. 1. In this section, more descriptions on this stage will be presented.

Creating an even grid

The even grid is a simple type of data structure for space partitioning, which is composed of regular cells such as squares or cubes; see an example of planar grid illustrated in Fig. 2. Compared to other efficient but complex space partitioning data structures such as the k-d tree, the even grid is much easier to create and search objects. In this work, we use a planar even grid to partition all data points to speed up the kNN search via local search.

The building of an even planar grid is straightforward. We first calculate or specify the width of the square cell, then determine the planar rectangular region for partitioning according to the minimum and maximum x and y coordinates of all points, i.e., obtain the length and width of the rectangle. After that, the numbers of rows and columns of the grid can be quite easily determined by dividing the rectangle.

Distributing data points into cells

The distribution of each data point is to find out that in which grid cell the data point locates. Since each grid cell can be located and recorded using its row and column indices, the distribution of each data point is in fact to obtained the row and column indices of the cell in which it locates.

This procedure can also be quite easily performed. First, the differences between the coordinates of the data points and the minimum coordinates of all cells are calculated; then the indices of row and column can be obtained by dividing the above differences with the cell width.

Determining data points in each cell

The most important and basic idea behind utilizing a space partitioning is to perform a local search within local regions rather than a global search. When searching nearest neighbors, it is computationally optimal to first search approximate nearest neighbors within several local cells and then to find the exact nearest neighbors by filtering undesired points.

Since the local search is operated within cells, it is thus needed to determine that which data points locate inside a specific cell. In other words, it is needed to know the number and the indices of those data points locating in the same cell. Moreover, the layout for storing the number and indices should be carefully handled.

For each grid cell, to store the above-mentioned number and indices of those data points locating in the same cell, in general, a dynamic array of integers needs to be allocated. In the traditional CPU computing, the allocation and operations of dynamic arrays are easy-to-implement and computationally inexpensive. However, in GPU computing, it is no longer easy to implement or computationally cheap. This is due to the following two reasons. (1) In GPU computing the programming model such as CUDA cannot support the allocation and operations of dynamic arrays/containers like vector and list in C++ STL (Standard Template Library); and (2) the allocation of a large-enough static array of integers, e.g., int index[1000], for storing the indices of data points within each GPU thread is not memory efficient.

Due to the above reasons, we design an optimal layout for storing the number and indices of data points. Our basic idea is as follows. If the indices of those points locating inside the same cell are stored in a continuous segment/piece of integer values, then we only need to know the address of the first point in the segment and the number of points in the same segment (i.e., the size of the segment).

In this case, for each cell, we can only use two integer values to record the number and the indices of those data points that locate in the same cell. One integer is used to hold the number, and the other is used to record the address of the head/first point in each segment. The above two values can be very efficiently determined in a parallel fashion.

Before determining the number and indices of data points locating in the same cell, those data points should be recorded continuously. Since we have obtained the index of the cell in which each data point locates, if we sort all data points according to their corresponding cell indices in ascending order, then those data points locating in the same cell can be gathered in a continuous segment. This sorting procedure is suited to be parallelized on the GPU.

The number of data points locating in the same cell is determined using segmented parallel reduction. As described above, after sorting all data points according to cell indices, all data points are stored in a group of segments; each segment is flagged with the cell index, and contains the indices of data points locating in the same cell. The number of data points locating in the same cell can be achieved by performing a reduction for each segment; see Fig. 3a. Similarly, the head index of the first point of each segment can be obtained using segmented parallel scan; see Fig. 3b.

Searching nearest neighbors

In this work, a space-partitioning data structure, the even grid, is employed to enhance the kNN search algorithm. The most important and basic idea behind utilizing the space partitioning is to perform a local search within local regions rather than a global search. This idea is quite effective in practice for that the number of points that are needed to find and compare can be significantly reduced, and therefore, the computational efficiency can be improved.

The process of kNN search for each interpolated point can be summarized as follows.

Step 1 Locate the interpolate point into the even grid
Step 2 Determine the level of cell expanding
Step 3 Find the nearest neighbors within the local region
Step 4 Calculate the average distance

The locating of each interpolated point into the previously created planar grid is quite straightforward. Since each grid cell can be located and recorded using its row and column indices, the distribution of each interpolated point is in fact to obtained the row and column indices of the cell in which it locates. First, the differences between the coordinates of the interpolated point and the minimum coordinates of all cells are calculated; then the indices of row and column can be obtained by dividing the above differences with the cell width.

The determining of the level of cell expanding is in fact to determine the region of cells in which the local nearest neighbors search should be carried out; see three levels of cell expanding in Fig. 2. In kNN search, the number of nearest neighbors, k, is typically pre-specified; and obviously, the number of data points locating in the local cells must be larger than the number k. Thus, the level of cell expanding can be iteratively determined by comparing the number of currently found data points with the number k. For example, when the k is specified as 15, and within the first level of local cells there are only ten data points, and thus the level 1 needs to expand to level 2. Similarly, if only 14 data points can be found within the second level of local cells, the level needs to be further expanded to 3. This procedure is iteratively repeated until enough data points have been found.

Remark

Note that after iteratively determining the level of cell expanding, for example, level 3, the final level of cell expanding needs to increase with 1, i.e., level 4. This is due to the following reason. Without expanding additional one level, the nearest neighbors found in the initial level of local cells may not the desired exact k nearest neighbors; see the marked data point in Fig. 4. When $k = 10,$ the determined level of cell expanding is 0 (i.e., the yellow region). However, the marked data point is obvious one of the nearest neighbors of the only interpolated point because it is much nearer to the interpolated point than several data points locating in the yellow region. This demonstrates that: without expanding additional one level, incorrect/undesired nearest neighboring data points are probably found; and several of the expected nearest neighboring data points may not able to be found.

The kNN search in the local cells is, in fact, to further find exact nearest neighbors by filtering some undesired points. We first allocate an array with the size of k for storing distances, and initiate all distances to 0. Then for each of those data points locating in the local cells, we calculate the distance dist, and compare the dist with the kth distance; and if dist is smaller than the kth distance, then replace the kth distance with the dist; after that, we iteratively compare and swap the neighboring two distances from the kth distance to the first distance until all the k distances are newly sorted in ascending order; see Mei et al. (2015) for more details.

After finding the nearest neighbors of each interpolated point, the distances between each nearest neighbor and the interpolated point can be calculated; and finally, the desired average distance can be obtained.

Stage 2: weighted interpolating

According to the principle of the AIDW interpolation algorithm, it is perfect that a single GPU thread can take the responsibility to calculate the prediction value of an interpolated point. For example, assuming there are n interpolation points that are needed to be predicted their values such as elevations, and then it is required to allocate n threads to calculate the desired prediction values for all those n interpolated points concurrently.

In GPU computing, data access on the shared memory is inherently much faster than that on the global memory; therefore, any choices to replace global memory access by shared memory access should be utilized. Due to the fact that the shared memory residing in the GPU is limited per SM (Stream Multiprocessor), a common optimization strategy called “tiling” is frequetnly used to handle the above problem, which divides the data stored in global memory into pieces called tiles so that each tile fits into the size of shared memory.

The above-mentioned strategy of “tiling” is also employed to enhance the GPU-accelerated AIDW algorithm. First, the coordinates of data points are loaded from the global memory into the shared memory. Then, all threads within the same thread block are able to concurrently access the coordinates currently residing in the shared memory. With the utilization of the strategy of “tiling”, the accesses to the global memory can be obviously reduced; and performance gains are expected to be achieved.

Implementation details

As introduced in the above section, the improved GPU-accelerated AIDW interpolation algorithm is mainly composed of two stages, i.e., the kNN search stage and the weighted interpolating stage. In this section, we will describe some implementation details on the above two stages.