Density visualization
Based on test data, this tool performs a density measurement and visualization that provides a preliminary density visualization of diseases. Results depend on zoom level to show different magnitude of clustering. The processing service performs a heat map web processing service (WPS) service to compute a spatial density map to detect high concentration area of malaria cases. Malaria reported cases were stored in PostGIS and served by Geoserver. When the user makes a change to zoom level, request is sent back to server to recalculate the spatial map with changes in values according to the zoom level (Fig. 4).
Nearest neighbour analysis
The test is based on comparing the observed average distances between nearest neighbouring points and those of a known pattern (Wong 2005) and is used to estimate the spatial proximity among events. User can draw a box around cluster of points to define boundary of study area, then select level of significance. The tool will calculate R scale, standard error, Z score through calling PostGIS function and compare it to significant interval to check whether the observed distribution is significant different from random pattern. The example showed that Z score is between (−1.96SE– >1.96SE) at 5 % significant level that means observed and random pattern are not statistically different (Fig. 5).
A function was built in PostGIS to calculate sum of nearest distances from selected points that are spatially within area of interest. The final sum of nearest distance is used for R test to check whether the observation is significantly different from random distribution.
-
St_contain (geomA, geomB): This function selects all geomA (malaria points) that spatially are inside the bounding box of geomB (box that user draw).
-
St_sum, St_Distance: The functions are used to calculate the shortest distances between malaria points and sum the nearest distances up.
With all measured parameters, this tool will calculate D statistic value and Dα for K–S test, by calling PostGIS function. The example showed that D statistics is greater than Dα = 5 % that means two distributions are significantly different in a statistical sense (Fig. 5a). As in the figure, we defined an area in Lam Dong province and the result was considered to be a non-random pattern at the α = 0.05 level.
K-function
One of the limitations of the nearest neighbour distance method is that it uses only the nearest distance, as such it only considers only the shortest scales of variation. The K-function provides an estimate of spatial clustering over a wider range of scales. This higher-order analysis based on all the distances between events in the study area and assumes isotropy over the region. The tool starts with drawing a box as a bounding box for malaria point selection, and continues selecting a distance or spatial lag. For each point in the selected area, a buffer with radius equal to spatial lag is drawn and sum of numbers of points within each buffer is calculated.
We build a web-based tool where users can interactively set an area of interest and set value for initial distance (or spatial lag) h. We use functions in PostGIS to create buffer for each point and calculate the sum of numbers of point that are spatially within the buffers. Result are stored in an array for graph display and spatial homogenous test. The final sum of the nearest distance is used for R test to check whether the observation is significantly different from random distribution. K(h) is then plotted against different values of h that describe paired comparison between observations and random patterns.
-
St_buffer (geom, spatial lag): Thus function creates buffer for each point with radius defined by a spatial lag.
-
St_contain (geomA, geomB): This function selects all geomA (malaria points) that spatially are inside the bounding box of geomB (box that user draw).
-
St_sum, St_Distance: The functions are used to calculate the shortest distances between malaria points and sum the nearest distances up.
Spatial autocorrelation
Neither the nearest neighbour, nor the K-function take point’s attributes into consideration. Different geographic locations rarely have identical characteristics. Because, this is aggregated data, in which one location might contain information on malaria incidences of an area (in this case, areas can be at commune or district level). The number of cases in an area can be used as weight of points in spatial analysis. Spatial autocorrelation is used in considering the effects of both distances between points and their attributes. With spatial autocorrelation, Moran’s I index is used to measure the proximity of location and similarity of the locations. In this study we develop a PostGIS tool to measure Moran’s I, in which data can be either continuous or count data.
Moran’s I coefficient of autocorrelation quantifies the similarity of an outcome variable among points that are defined as spatially related. It measures the proximity of disease points and the similarity of the characteristics of those points (Wong 2005). The tool starts with drawing study area and then by selecting level of significance. It then calculates E(I), Z(I) value through spatially processing distances between points, through calling PostGIS function. This example showed that Z(I) lied between significant interval, so that the observed distribution is not statistically different from random pattern as described in Eq. (1).
$$Moran's I = \frac{{ - \mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j = 1}^{n} w_{ij} (x_{i} - \overline{x} )(x_{j} - \overline{x} )}}{{\text{s}^{2} \mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j = 1}^{n} w_{ij} }}$$
(1)
In which s2
\(= \frac{{\mathop \sum \nolimits_{i = 1}^{n} (x_{i} - \overline{x} )^{2} }}{n}\), n = number of points.
-
St_contain (geomA, geomB): This function selects all geomA (malaria points) that spatially are inside the bounding box of geomB (box that user draw).
-
St_Distance: The functions are used to calculate the distances between malaria points and weight between point i and j is calculated as wij = 1/St_distance (geomi, geomj) wii is equal to 0.
-
St_sum: Sum up attribute values of all points.
Expected E(I) is measured as E(I) = −1/(number of points −1). Cluster pattern, Random pattern and dispersed patterns are detected when calculated Moran’s I value is higher than E(I), equal to E(i) or small than E(I) respectively. In this example, with the calculated score was higher than expected E(I) of −0.185, the malaria points showed a slightly statistically significant departure from a random pattern. This was demonstrated by high score of 0.0213 [a bit higher than random E(I)].
The selection of proper geographic scale for clustering analysis affects the analysis results. Objects may be detected as clustered, dispersed or random depending on the actual scale (in this case zoom scale) we define before proceeding the analysis. In fact, data was geocoded using reported addresses. However, there were missing in documenting correct addresses, then they were aggregated into communes or even into districts. The effect is known as Modifiable aerial unit problem that can radically affect the analysis results.
Indeed, the examples describe three spatial detection tools that compare the observed distribution to random pattern of points, or more specifically pattern of diseases. These functionalities are powerful testing tool to convey message to and to support decision making. By zooming in and out, in combination with clustering tool, users define the boundary of study area and understand changes of analytical results that vary with zoom scale. Nearest neighbour can be used for the global test of clustering, it measures distances between points and compare these distances to known pattern. It has been extended (in comparison to Quadrat statistics) to accommodate second, third, and higher order neighbourhood definitions. The outcome results of this analysis vary depending on spatial scale, because some look clustered in small scale but seem dispersed at large scale. If boundary box size is too small, each polygon may contain a couple of points. On the other hand, if it is too large, each polygon contains many points. This measure does not take into account that different points may be different in how points are represented (due to their differences in characteristics).
The second analysis takes into consideration distances among points and results in plot, describing areas where patterns are clustered or dispersed. All spatial processing of three analyses are done in PostGIS and measured statistics are sent back to client. The third one measures the proximity of location in considering effects of locational characteristics. However, this test considers the population at risk is evenly distributed within the study area and correlation in all directions are to be the same (Moran 1950). This is a drawback of the Moran’s I test. Through the Web based pattern detection tools, end users can interactively define (draw) area for measurement. The tools are three straightforward measurement of clustering and diversion that provide preliminary assessment to disease distribution. More spatial tests should be further incorporated.