CudaChain: an alternative algorithm for finding 2D convex hulls on the GPU

This paper presents an alternative GPU-accelerated convex hull algorithm and a novel Sorting-basedPreprocessingApproach (SPA) for planar point sets. The proposed convex hull algorithm termed as CudaChain consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of calculating the expected convex hull on the CPU. Those interior points locating inside a quadrilateral formed by four extreme points are first discarded, and then the remaining points are distributed into several (typically four) sub regions. For each subset of points, they are first sorted in parallel; then the second round of discarding is performed using SPA; and finally a simple chain is formed for the current remaining points. A simple polygon can be easily generated by directly connecting all the chains in sub regions. The expected convex hull of the input points can be finally obtained by calculating the convex hull of the simple polygon. The library Thrust is utilized to realize the parallel sorting, reduction, and partitioning for better efficiency and simplicity. Experimental results show that: (1) SPA can very effectively detect and discard the interior points; and (2) CudaChain achieves 5×–6× speedups over the famous Qhull implementation for 20M points. Electronic supplementary material The online version of this article (doi:10.1186/s40064-016-2284-4) contains supplementary material, which is available to authorized users.


Introduction
The finding of convex hulls is a fundamental issue in computer science, which has been extensively studied for many years.Several classic algorithms have been proposed, including the Graham scan [1], the Jarvis's march [2], the divide-and-conquer algorithm [3], the Andrew's monotone chain [4], the incremental approach [5], and the QuickHull [6].
Recently, to speed up the calculating of convex hulls for large sets of points, several efforts have been carried out to redesign and implement several commonly used CPU-based convex hull algorithms on the GPU.For example, Srikanth, et al. [7] and Srungarapu, et al. [8] parallelized the QuickHull algorithm [9] to accelerate the finding of two dimensional convex hulls.Based on the QuickHull approach, Stein, et al. [10] presented a novel parallel algorithm for computing the convex hull of a set of points in 3D using the CUDA programming model.Tang, et al. [11] developed a CPU-GPU hybrid algorithm to compute the convex hull of points in three or higher dimensional spaces.
Tzeng and Owens [12] presented a framework for accelerating the computing of convex hull in the divide-and-conquer fashion by taking advantage of QuickHull.Similarly, White and Wortman [13] described a pure GPU divide-and-conquer parallel algorithm for computing 3D convex hulls based on the Chan's minimalist 3D convex hull algorithm [14].In [15], a novel algorithm is proposed to compute the convex hull of a point set in R 3 by exploiting the relationship between the Voronoi diagram and the convex hull.In addition, Gao, et al. [16] designed ffHull, a flip algorithm that allows nonrestrictive insertion of many vertices before any flipping of edges and maps well to the massively parallel nature of the modern GPU.
When calculating the convex hull of a set of points, an effective strategy for improving computational efficiency is to discard the interior points that have been exactly determined previously.This strategy is referred to as the preprocessing/preconditioning procedure.The most commonly used preprocessing approach is to form a convex polygon or polyhedron using several determined extreme points first and then discard those points that locate inside the convex polygon or polyhedron; see [10,11].The simplest case in two dimensions is to form a convex quadrilateral using four extreme points with min or max x or y coordinates and then check each point to determine whether it locates inside the quadrilateral; see [17].Another quite recent effort for efficiently discarding interior points was introduced in [18].
In this paper, our objective is to design and implement an efficient and practical convex hull algorithm by exploiting the power of GPU.We make the following two contributions: (1) we propose a novel and effective Sorting-based Preprocessing Approach (SPA) for discarding interior points; (2) we present an efficient GPU-accelerated algorithm for finding the convex hulls of planar point sets based on the algorithm SPA.
The proposed convex hull algorithm consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of calculating the expected convex hull on the CPU.We first discard the interior points that locate inside a quadrilateral formed by four extreme points, and then distribute the remaining points into several (typically four) sub regions.For each subset of points, we first sort them in parallel, then perform the second round of discarding using SPA, and finally form a simple chain for the current remaining points.A simple polygon can be easily generated by directly connecting all the chains in sub regions.We at last obtain the expected convex hull of the input points by calculating the convex hull of the simple polygon using the Melkman algorithm [19].
Our algorithm is implemented by heavily taking advantage of the library Thrust [20] for better efficiency and simplicity.In our implementation, we directly use the very efficient data parallel primitives such as parallel sorting, reduction, and partitioning that are provided by Thrust.The use of the library Thrust makes our implementation simple to develop.
We test our algorithm against the Qhull library [21] on various datasets of different sizes using two machines.Experimental results show that our algorithm achieves 5x ~ 6x speedups over the Qhull implementation for 20M points.Thus, this algorithm is competitive in practical applications for its simplicity and satisfied efficiency performance.

Algorithm Design
The proposed GPU-accelerated algorithm is designed on the basis of the fast convex hull algorithm introduced by Akl and Toussaint [17].The procedure of our algorithm mainly consists of three steps: (1) we first carry out a first round of preprocessing by discarding those points locating inside a quadrilateral formed by four extreme points.This commonly used strategy of preprocessing was described in [17]; (2) we then distribute the remaining points into several (typically four) sub regions, sort the points in the same region according to their coordinates, and perform a novel Sorting-based Preprocessing Approach (SPA) to in further discard interior points for each sub region and form a simple polygon; (3) we finally employ Melkman's algorithm [19] to calculate the convex hull of the simple polygon.The obtained convex hull is exactly the expected convex hull of the input point set.The first and second steps of our algorithm are performed on the GPU, while the third is carried out on the CPU.
More specifically, the procedure of the proposed algorithm is listed as follows: 1) Find four extreme points that have the max or min x and y coordinates by parallel reduction, denote them as P minx , P maxx , P miny , P maxy 2) Determine the distribution of all points in parallel, and discard the points locating inside the quadrilateral formed by P minx , P miny , P maxx , and P maxy 3) Denote the subset of points locating in the four sub regions, i.e., the lower left, lower right, upper right, and upper left as S R1 , S R2 , S R3 , and S R4 , respectively 4) Sort S R1 , S R2 , S R3 , and S R4 separately in parallel; see Table 1 for the orders of sorting 5) Perform the SPA for S R1 , S R2 , S R3 , and S R4 to discard interior points in further, and form a simple chain for the remaining points in each sub region 6) Form a simple polygon by connecting those four chains in counterclockwise (CCW) 7) Find the convex hull of the simple polygon using Melkman's algorithm [19] In the above procedure, the most commonly used strategy for discarding interior points is first carried out (i.e., the step 1 and step 2); and then those remaining points are divided into four subsets.After that, each subset of points is sorted separately.The key step in this procedure is the second round of discarding interior points and the forming of a simple chain for each subset.A simple polygon can be easily created by directly connecting the chains; and the expected convex hull can be found using Melkman's algorithm [19] which is specifically designed for calculating the convex hull of a simple polygon.The strategy of discarding the interior points locating inside a quadrilateral formed by four extreme points is straightforward; see Figure 1.There is no need to describe this strategy in more details.The only remarkable issue is that: to reduce the computational cost, in the process of checking whether a point is interior (i.e., locating in the region R 0 in Figure 1(a)), the distribution of those non-interior points can also be easily determined.For all points, we use the following simple method to determine their distributions: (1) if point P lies on the left side of the directed line P minx P miny , then it falls in the region R 1 ; (2) else if P lies on the left side of the directed line P miny P maxx , then it falls in the region R 2 ; (3) else if P lies on the left side of the directed line P maxx P maxy , then it falls in the region R 3 ; (4) else if P lies on the left side of the directed line P maxy P minx , then it falls in the region R 4 ; (5) else P falls in the region R 0 After the above procedure of determination, all points are distributed into five regions.Those points in the region R 0 are interior ones, and need to be directly discarded in the step, while the remaining points in the other four regions should be taken into consideration for calculating the convex hull.

Step 2 : Second Round of Discarding and Forming Simple Polygon
Figure 2 The rules for discarding interior points of the SPA This section will describe a novel sorting-based preprocessing approach that is specifically applicable to the previously sorted points.We term this method as the SPA.The rules for discarding interior points in those four sub regions, i.e., lower left (R 1 ), lower right (R 2 ), upper right (R 3 ), and upper left (R 4 ), are presented in Figure 2.
The correctness of SPA for each sub region is demonstrated in Figure 3.For the region R 1 , the first point and the last point are the P minx and P miny , respectively; see Figure 3(a).Assuming the point P i has been determined to be non-interior, and now we check the point P according to the relationship between P i and P. Since that all the points in the region R 1 have been sorted in the ascending order of x coordinates, thus x P > x Pi ; and if y Pi is larger than y P , then the point P must be located in the shaded triangular area; and obviously it also falls in the triangle formed by three points P i , P miny , and P minx .Hence, the point P must be an interior one and needs to be discarded.The correctness of SPA for other three regions can also be explained similarly.We also take the forming of the chain in the upper left region (R 3 ) for an example; see Figure 4. Previously, seven points have been sorted in the descending order of x.We first check the point P 1 ; and obviously it is not an interior point according the rule presented in the Figure 2 (c).Similarly, we also find that the point P 2 is an exterior point and needs to be kept.However, the point P 3 is an interior point since its y coordinate is less than that of the point P 2 ; and obviously, the point P 3 locates inside the triangle formed by the first point (i.e., the point P maxx ), the last point (i.e., the point P maxy ), and P 2 .After discarding the point P 3 , the point P 4 can also be discarded, while both the points P 5 and P 6 are not exterior points.However, since that the coordinate y of the point P 7 is less than that of the point P 6 , it also should be removed.The output of the previous step is a simple polygon, which is also an approximate convex hull.To calculate the exact convex hull of the input point set, we select the fast algorithm introduced by Melkman [19] to compute the convex hull of the simple polygon.The convex hull of the simple polygon is the convex hull of the input data set; see Figure 5.

Implementation Details
In this section, we describe more details about the implementation of our algorithm.The implementation has both the CPU side (host) and the GPU side (device) code.The code on the CPU side is developed to compute the convex hull of a simple polygon, which is relatively simple and easy to implement when compared to the code on the GPU side.Thus, we focus on the development of the GPU side code.
The implementation on the GPU side is developed by heavily taking advantage of the library Thrust for better efficiency and simplicity when using the data-parallel algorithm primitives.Currently, several GPU-accelerated libraries have been designed to provide dataparallel algorithm primitives such as parallel scan, parallel sort and parallel reduction.Such libraries include Thrust [20], CUDPP [22], and CUB [23].Since that the library Thrust has been integrated in CUDA since version 4.0, it is very easy and convenient to use Thrust directly in CUDA.Hence, we select the library Thrust rather than the other two libraries.
Figure 6 The implementation of the proposed algorithm (CudaChain)

Performing the First Round of Discarding on the GPU
The first step of discarding the interior points locating inside the quadrilateral formed by four extreme points is to find those points with the min or max x / y coordinates.In sequential programming pattern, a loop over all input points needs to be carried out to find the min or max values.In parallel programming pattern, the finding of min or max values in a vector can be efficiently achieved by performing a parallel reduction.Thrust provides such common data-parallel primitive and several easy-to-use interface functions.We use two functions, i.e., thrust::min_element() and thrust::max_element(), to efficiently find the min and max coordinates of all points in parallel; see lines 11 -14 in Figure 6.
To avoid the transformation between device memory and host memory, we directly obtain the memory addresses of the coordinates of extreme points and all input points that resides on the GPU using the function thrust::raw_pointer_cast(), and then pass them as the launch arguments for the kernel kernelPreprocess; see lines 24 -28 in Figure 6.
We design a kernel, kernelPreprocess, to determine in which region a point falls.Each thread is responsible for calculating the position of a point; and the results are stored in an array d_pos[n].The method for determining the distribution of points is introduced in Section 2.1.1.We use an integer as an indicator of the position.For example, if a point P i locates inside the region R 1 , then the value d_pos[i] is set to 1; and the indicator value of an interior point is 0. All the points that are not in the region R 0 are called exterior or remaining points.A simple example is presented in Figure 7(a).

Performing the Second Round of Discarding on the GPU
Figure 7 The second round of discarding interior points To carry out the second round of discarding interior points, we: (1) perform four parallel partitioning for all points according to their positions, (2) sort the points in each region separately, (3) invoke a kernel for each region to discard the interior points using the method SPA, and (4) perform another parallel partitioning for all exterior points.The first step, parallel partitioning, is carried out to gather those points in the same region together for subsequent procedure of sorting.After partitioning, the points that locate in the same region reside in a consecutive segment; see Figure 7(b).In this case, parallel sorting can be performed for each segment of points.Noticeably, we choose to partition and sort each segment of points in place to minimize the cost of memory space.
After parallel sorting, we design a kernel for each region to discard the interior points using the method SPA.There is only one thread block within the kernel's thread grid.Each thread in the only thread block is responsible for checking consecutive (m + BLOCK_SIZE -1) / BLOCK_SIZE points in the same region, where m is the number of points in a region for being checked, and BLOCK_SIZE represents the number of threads in the only block.In our implementation, we set BLOCK_SZIE to 1024 according to the compute capability of the adopted GPU.After checking and discarding interior points using SPA, some previous exterior points have been determined as interior ones; and their corresponding indicator values are modified to 0; see Figure 7(c).
In our implementation, we only allocate one thread block in the discarding of interior points using the proposed method SPA due to the data dependency issue in the discarding.When checking whether a point in a specific region such as the region R 1 , the y coordinate of the point being checked is compared to that of the current last point of the formed chain; see Figure 2(a).This means the checking for a point, e.g., P i , depends on the checking of the previous point P i-1 .It also means the checking for a set of consecutive points can only be performed in a sequential pattern.However, it is able to first divide a large set of consecutive points into some smaller subsets of consecutive points, and then perform the checking in parallel for each subset of points separately.This solution is in the divide-and-conquer fashion.We adopt this solution; but we cannot determine the optimal size of a subset of points or the num be r of all subse ts.Thus, we de cide to divide a large se t of conse cutive points into BLOCK_SIZE subsets, while each subset contains (m + BLOCK_SIZE -1) / BLOCK_SIZE points; and then, we carry out the checking for all the BLOCK_SIZE subsets in parallel.
The final step is to re-partition and copy the coordinates of the exterior points in current stage for outputting.Noticeably, to preserve the relative order of the sorted points, we use the function thrust::stable_partition() rather than thrust::partition() to compact the exterior points; see lines 60 -64 in Figure 6.After the stable partitioning, the remaining exterior points are stored consecutively and can be easily copied in parallel for being used on the host side (on the CPU); see Figure 7(d).

Results
We have tested our algorithm against the Qhull library [21] on various datasets of different sizes using two machines.The first machine features an Intel i7-3610QM processor (2.30GHz), 6GB of memory and a NVIDIA GeForce GTX660M graphics card.The other machine has an Intel i5-3470 processor (3.20GHz), 8GB of memory and a NVIDIA GeForce GT640 (GDDR5) graphics card.The graphics card GTX 660M has 2GB of RAM and 384 cores; and the GT640 graphics card has 1GB of RAM and 384 cores.We have used the CUDA toolkit version 5.5 on Window 7 Professional for evaluating all the experimental tests.
We have created two groups of datasets for testing.The first group includes 8 sets of randomly distributed points in a square that are generated using the rbox component in Qhull.The second group consists of 10 point sets that are derived from 3D mesh models by projecting the vertices of each 3D model onto the XY plane.These mesh models presented in Figure 8 are directly obtained from the Stanford 3D Scanning Repository 1 and the GIT Large Geometry Models Archive 2 .

Efficiency on the GTX 660M
The running time on the GPU GTX 660.M of two groups of testing data, i.e., the group of randomly distributed point sets and the group of point sets derived from 3D models, is listed in Table 2 and Table 3, respectively.To evaluate the computation load between the GPU side and the CPU side of our algorithm, we count the running time separately for both of the two sides and calculate the workload percentage of the CPU side.
For both the two groups of experimental tests, the experimental results show that: for small size of testing data, the Qhull is faster than the proposed CudaChain, while CudaChain is much faster than Qhull for the large size of testing data.The speedups of CudaChain over Qhull become larger with the increasing of the data size.The speedup is about 3x~4x on average and 5x~6x in the best cases.
The workload percentage of the CPU side is much smaller than that on the GPU side; and it decreases for the group of randomly point sets when the data size increases.In addition, the workload percentage of the CPU side is usually less that 10%, except for the test of the model Happy Buddha.9 The efficiency of CudaChain on the GPU GTX 660M against CPU-based Qhull using the same datasets and the same machine

Efficiency on the GT 640
On the machine with the GPU GT640, the running time of two groups of testing data is listed in Table 4 and Table 5.Similar to those experimental results obtained on the machine with the GTX 660M, for small size of testing data, the Qhull is also faster than our algorithm CudaChain, while CudaChain is much faster than Qhull for the large size of testing data.The speedups of CudaChain over Qhull also become larger with the increasing of the data sizes.
The speedup is about 3x~4x on average and 4x~5x in the best cases; however, for the largest model Lucy, the speedup is 5.2x on the GT 640, while it is 6x on the GTX 660M.
The experimental results obtained on the machine with the GT 640 also indicate that: the workload percentage of the CPU side is much smaller than that of the GPU side; and it decreases for the group of randomly point sets when the data size increases.The behaviors are the same as those on the GTX 660M.Furthermore, the workload percentage of the CPU side is usually also less that 10%, except for the test of the model Happy Buddha.10 The efficiency of CudaChain on the GPU GT 640 against CPU-based Qhull using the same datasets and the same machine There are two rounds of discarding in our algorithm.To evaluate the effectiveness of the proposed preprocessing method SPA, we count the remaining points after each round of discarding, and compare the effectiveness of two rounds of discarding.The results presented in Figure 11 show that SPA can dramatically reduce the number of remaining points and thus improve the overall efficiency of CudaChain.In addition, the effectiveness of discarding interior points by SPA becomes better with the increasing of the data size.

Comparison
We have tested our algorithm CudaChain on two different machines with different GPUs.The efficiency performances of our algorithm on the two machines are almost the same.This result is due to the fact that the two GPUs, i.e., GTX 660M and GT 640, have the similar compute capability.However, the speedups of our algorithm over the implementation Qhull on two machines slightly differ.This behavior is lead by the different efficiency performance of CPU-based Qhull on the two machines.
Compared to other GPU-accelerated convex hull algorithms such as those implemented by parallelizing the QuickHull algorithm on the GPU [7,8,10,12], the main advantage of our algorithm is that it is very easy to implement, which is mainly due to (1) the use of the library Thrust and (2) relatively less data dependencies.The data-parallel primitives such as parallel sorting and parallel reduction provided by Thrust are very efficient and easy to use; we can directly use these primitives in CUDA to realize our implementation without too many efforts.In addition, in our algorithm the only step that has data dependency is the checking and discarding interior points using SPA.Other steps or procedures can be very well mapped to the massively parallel nature of the modern GPU.This feature of having less data dependencies also makes our algorithm simple and easy to implement in practical applications.

Complexity and Correctness
The time complexity in the worst case of the second round of discarding is O(nlogn) due to the sorting of points.Both the first round of discarding and the calculating of convex hull of a simple polygon run in O(n).Thus, the worst case time complexity of the entire algorithm is O(nlogn).
The space requirement of our algorithm is also efficient.Only three arrays for storing all the input points' coordinates and positions need to be allocated on the GPU.The parallel sorting, parallel reduction, and parallel partitioning completely operate on those three arrays in place without needing to explicitly allocate any additional global memory.In addition, to avoid the transformation from device memory and host memory and then back to device memory when invoking user-designed kernels, we directly obtain the memory addresses of those three arrays that resides on the GPU using the function thrust::raw_pointer_cast(), and then pass the addresses as the launch arguments for kernels.
The correctness of our algorithm can obviously be guaranteed.It is clear that: (1) in the calculating of convex hulls, any potential extreme points should not be discarded; and (2) any points that have been identified as interior ones can be discarded.As mentioned several times, there are two rounds of discarding in our algorithm.In the first round of discarding, those points locating in the quadrilateral formed by extreme points are definitely the interior ones and can be discarded.In the second round of discarding, we have proved that the points detected as the interior using the proposed preprocessing method SPA is reasonable; see Figure 3. Thus, this discarding can also be guaranteed to be correct.After two rounds of discarding, all the remaining points are used to calculate the expected convex hull.
In short, our algorithm can be guaranteed to be correct since we (1) only remove recognized interior points and (2) preserve all remaining points to avoid discarding any potential extreme points.

Limitation
The first shortcoming of our algorithm is that: the efficiency of discarding interior points using SPA within a single thread block cannot be guaranteed to be the highest.Probably, to hide memory latency and improve the efficiency, it needs to allocate several thread blocks in the discarding of interior points using the method SPA; and each thread is responsible for checking and discarding interior points for a small subset of consecutive points.However, the optical number of consecutive points that assigned to be checked in each thread to generate the highest efficiency cannot be determined; this is due to (1) the different data size of input points and the number of remaining points after the first round of discarding, and (2) the compute capability of the adopted GPU.Thus, it needs more experimental tests to determine the optical number of points that are assigned to each thread.
Another limitation is that: when all the input points are initially extreme points, e.g., when all points exactly locate on a circle, two rounds of discarding interior points will be wasteful since there are no interior points that will be found and removed.All the input points will be kept and used to calculate the convex hull on the CPU using the Melkman algorithm [19].Hence, the overall execution in this case could be very slow.

Conclusion
We have proposed a novel sorting-based preprocessing approach (SPA) and an efficient algorithm for calculating the convex hulls of planar point sets on the GPU.Our algorithm consists of two rounds of preprocessing procedures performed on the GPU and the finalization of calculating the expected convex hull on the CPU.We have used the library Thrust to realize the parallel sorting, reduction, and partitioning for better efficiency and simplicity.We have tested our algorithm against the Qhull library on various datasets of different sizes using two machines.Experimental results show that our algorithm achieves the speedups of 3x~4x on average and 5x~6x in the best cases.We believe that our algorithm is competitive in practical applications for its simplicity and satisfied efficiency.
When implementing our algorithm, we have heavily taking advantage of the library Thrust.An efficient counterpart of Thrust, CUB [23], has been developed recently.It was reported that CUB is faster than Thrust.We expect to gain a significant increase in overall performance of our algorithm by replacing Thrust with CUB.Future work should therefore include the implementation our algorithm using CUB and the evaluation of efficiency performance.

2. 1 . 1 Figure 1
Figure 1The distribution of points and the first round of discarding

Figure 3
Figure 3 Demonstrations for the correctness of discarding interior points in the method SPA.(a): the region R 1 ; (b) the region R 2 ; (c) the region R 3 ; (d) the region R 4

Figure 4 A
Figure 4 A simple example of forming the chain in the upper left region 2.1.3Step 3: Calculating the Convex Hull of Simple PolygonThe output of the previous step is a simple polygon, which is also an approximate convex hull.To calculate the exact convex hull of the input point set, we select the fast algorithm introduced by Melkman[19] to compute the convex hull of the simple polygon.The convex hull of the simple polygon is the convex hull of the input data set; see Figure5.

Figure 5
The convex hull of a simple polygon

Figure 8
Figure 8 3D mesh models from Stanford 3D Scanning Repository and GIT Large Geometry Models Archive.From the left to the right, the models are: Armadillo, Angel, Skeleton Hand, Dragon, Happy Buddha, Turbine Blade, Vellum Manuscript, Asian Dragon, Thai Statue, and Lucy (a) Randomly distributed points (b) Points derived from 3D models Figure11The effectiveness of two rounds of discarding interior points.(First round of discarding: removing the points inside the quadrilateral formed by four extreme points.Second round of discarding: removing interior points according to the rules of the method SPA)

Table 3
Comparison of running time (/ms) for point sets derived from 3D models on GTX 660M

Table 4
Comparison of running time (/ms) for randomly distributed point sets on GT 640

Table 5
Comparison of running time (/ms) for point sets derived from 3D models on GT 640