The association in a two-way contingency table through log odds ratio analysis: the case of Sarno river pollution

Camminatiello, Ida; D’Ambra, Antonello; Sarnacchiaro, Pasquale

doi:10.1186/2193-1801-3-384

Methodology
Open access
Published: 28 July 2014

The association in a two-way contingency table through log odds ratio analysis: the case of Sarno river pollution

Ida Camminatiello¹,
Antonello D’Ambra¹ &
Pasquale Sarnacchiaro²

SpringerPlus volume 3, Article number: 384 (2014) Cite this article

1819 Accesses
5 Citations
Metrics details

Abstract

In this paper we are proposing a general framework for the analysis of the complete set of log Odds Ratios (ORs) generated by a two-way contingency table. Starting from the RC (M) association model and hypothesizing a Poisson distribution for the counts of the two-way contingency table we are obtaining the weighted Log Ratio Analysis that we are extending to the study of log ORs. Particularly we are obtaining an indirect representation of the log ORs and some synthesis measures. Then for studying the matrix of log ORs we are performing a generalized Singular Value Decomposition that allows us to obtain a direct representation of log ORs. We also expect to get summary measures of association too. We have considered the matrix of complete set of ORs, because, it is linked to the two-way contingency table in terms of variance and it allows us to represent all the ORs on a factorial plan. Finally, a two-way contingency table, which crosses pollution of the Sarno river and sampling points, is to be analyzed to illustrate the proposed framework.

Introduction

Polycyclic aromatic hydrocarbons (PAHs) are a group of lipophilic contaminants widespread in the environment. This class of compounds has been widely studied (Tolosa et al., 1995; Caricchia et al., 1993) because of its carcinogenic and mutagenic properties (Lehr and Jerima, 1997; Yan, 1985; White 1986).

PAHs are produced by both anthropogenic and natural processes and can be introduced into the environment through various routes. Anthropogenic inputs can originate from the incomplete combustion of organic matter (pyrolytic) and the discharge of crude oil-related material (petrogenic). PAHs can also originate from natural processes such as short-term diagenetic degradation of biogenic precursors (diagenesis). Each source (i.e. pyrolytic, petrogenic and diagenetic) gives rise to characteristic PAH patterns. Currently, the interest in multivariate statistical methodologies for identifying the main sources of PAH pollution and for quantifying the incidence of each source of pollution on total pollution levels, particularly in coastal environments, is increasing (Luo et al., 2006; Bihari et al., 2007; Xu et al., 2007; Sarnacchiaro et al., 2012). This study is part of a large project which has the objective of enhancing the knowledge of pollution in the Sarno River and its environmental impact on the gulf of Naples. This project has attempted to assess the pollution derived from local industries, agriculture and urban impact (Sarnacchiaro et al., 2012). In the present work, we have studied the association between the level of PAH pollution and the sampling points.

The analysis of the association for variables placed in a I × J two-way contingency table is a topic widely discussed. In this paper we focus our attention on the Odds Ratios (ORs) as measure of association. In a two-way contingency table the total number of ORs, that can be computed, may be too large, for their synthesis four main alternatives or complementary strategies have been performed. The first consists in the computation of statistical measures (Altham, 1970). The second is based on the construction of the model for frequencies and studies the ORs through the interaction between the row and column variables. The log-linear model for two-way contingency table belongs to this class. The third solution is the RC (M) association model (Goodman, 1985), which is more parsimonious than the usual log-linear model (Choulakian, 1988). The fourth strategy takes in consideration Singular Value Decomposition (SVD) of the matrix containing the basic set of log ORs (de Rooij and Anderson 2007).

In this paper we have proposed a general framework for the analysis of the complete set of log ORs generated by a two-way contingency table. The road map of the general framework is the following: we started from the RC (M) association model and we hypothesized a Poisson distribution for the counts of the two-way contingency table. Then for parameter estimations of RC (M) we use an alternative approach based on least squares. The matrix for the estimation of bilinear part of RC (M) has been linked to the matrix used in log-ratio analysis (Greenacre, 2009), moreover we have extended log-ratio analysis (LRA) to the study of log ORs. Then we have connected these methodologies with De Rooiij and Anderson’s approach and we have introduced some important properties that have allowed us to have deeper knowledge on the association between variables troughs ORs. Differently from De Rooiij and Anderson, in our approach we have chosen to consider the matrix of a complete set of ORs, because, as we will show, this matrix is linked to the two-way contingency table and it allows us to represent all the ORs on a factorial plan. Moreover, the spanning cell odds ratios are useful when one of the categories defines a control or reference group. In that case, all other categories are described against this reference group. The local odds ratios are useful for ordinal variables when all local odds ratios are larger or equal to 1 (de Rooij and Anderson 2007).

Materials and methods

The research plan - study area, sampling points

Nicknamed “the most polluted river in Europe”, the Sarno River originates in south-western Italy and has a watershed of about 715 km². An intensive sampling campaign was conducted in the spring of 2008. Surface sediment samples were collected at four locations along the Sarno river (near the source of the river, just before and after the junction with Alveo Comune and at the river mouth) and nine points in the continental shelf around the river mouth (three points, one for each direction North-West, West and South-West, were sampled 50 m from the Sarno River mouth, another three points 150 m away and, finally, another three points 500 m from the river mouth). The collected data were arranged in a two-way contingency table. The row variable is TPAHs (X) with three categories: Low (L), Medium (M), High (H) and column variable is the sampling points (Y) with five categories: Source (S), River (R), 50 m from the Sarno River mouth (50 m), 150 m from the Sarno River mouth (150 m), 500 m from the Sarno River mouth (500 m).

Log-ratio and log odds ratio analysis

Notations

Let N = (n_ij) be a two-way contingency table that cross-classifies n units according to I row categories and J column categories of X and Y variables, respectively. Let X_i and Y_j be the i-th and j-th category of X and Y and let π_ij the probability that X = X_i and Y = Y_j. The matrix of proportions is denoted by P = n^− 1N with general term p_ij. The marginal relative frequencies of the i- th row and j- th column of P are p_i • and p_• j and they may be represented in vector or matrix form. In this paper, the vector r (resp. c) consists of p_i • (resp. p_• j) as elements, while D_r (resp. D_c) is the diagonal matrix of these quantities.

Let $O R_{ii' jj'} = \frac{n_{ij} n_{i' j'}}{n_{i' j} n_{ij'}} (1 \leq i < i' \leq I; 1 \leq j < j' \leq J)$ be the OR, the complete set of ORs for table N is composed by [I(I − 1)]/2 × [J(J − 1)]/2 ORs and it can be placed in a two-way table, called $S = [s_{\tilde{i} \tilde{j}}]$ , of dimension $\tilde{I} \times \tilde{J}$ , where $\tilde{I} = I (I - 1) / 2$ and $\tilde{J} = J (J - 1) / 2$ .

From association model to log Ratio Analysis

The association models (Goodman, 1985) are widely used to analyse two-way contingency tables. The first proposed version was the RC (1) association model (Goodman, 1979), then it was extended to the RC (M) association model to decompose the symmetric association into M components (Goodman, 1985). If M = min[(I − 1), (J − 1)], this model is called saturated. The RC (M) association model is given by

π_{ij} = α_{i} β_{j} exp (\sum_{m = 1}^{M} φ_{m} μ_{im} ν_{jm})

where μ_im and ν_jm are X_i and Y_j scores on dimension m (standard coordinates), φ_m is a measure of the strength of the association between X and Y, α_i and β_j are the main effects of X and Y, respectively. With respect to the scores, the following constraints are assumed: $\sum_{i = 1}^{I} π_{i •} μ_{im} = \sum_{j = 1}^{J} π_{• j} ν_{jm} = \sum_{i = 1}^{I} π_{i •} μ_{im} μ_{im'} = \sum_{j = 1}^{J} π_{• j} ν_{jm} ν_{jm'} = 0$ , and $\sum_{i = 1}^{I} π_{i •} μ_{im}^{2} = \sum_{j = 1}^{J} π_{• j} ν_{jm}^{2} = 1$ where $π_{i •} = \sum_{j = 1}^{J} π_{ij}$ and $π_{• j} = \sum_{i = 1}^{I} π_{ij}$ .

Assuming the previous constraints and that the distribution of counts within IJ categories is a multinomial distribution with parameters n and π_ij, the parameter estimation is computed by the maximum likelihood method.

An alternative estimation method is based on the least square procedure. Let N_ij ~ Po(nπ_ij = τ_ij) be a random variable, if we perform the logarithm transformation we obtain the difference log(N_ij/n) − log(π_ij).

Replacing the random variable with its sample values and considering the RC (M) association model we have

\log (p_{ij}) = \log ({\hat{α}}_{i}) + \log ({\hat{β}}_{j}) + \sum_{m = 1}^{M} λ_{m} u_{im} v_{jm}

Substituting the probabilities with observed frequency, taking into account the constraints and the condition $\sum_{i = 1}^{I} \log ({\hat{α}}_{i}) p_{i •} = 0$ , we estimate the parameters log(β_j) and log(α_i) as follows: $log ({\hat{β}}_{j}) = \sum_{i = 1}^{I} p_{i •} log (p_{ij})$ and $log ({\hat{α}}_{i}) = \sum_{j = 1}^{J} p_{• j} log (p_{ij}) - \sum_{i = 1}^{I} \sum_{j = 1}^{J} p_{i •} p_{• j} log (p_{ij})$ .

The estimation of the bilinear part is obtained through the least squares method (D’Ambra, 1988; Escoufier and Junga 1986), minimizing the quantity

\min [\sum_{i = 1}^{I} \sum_{j = 1}^{J} p_{i •} p_{• j} {(a_{ij} - \sum_{m = 1}^{M} λ_{m} u_{im} v_{jm})}^{2}]

where $\begin{array}{l} a_{ij} = log (p_{ij}) - log ({\hat{β}}_{j}) - log ({\hat{α}}_{i}) = log (p_{ij}) - \sum_{i}^{I} p_{i •} log (p_{ij}) - \\ \sum_{j}^{J} p_{• j} log (p_{ij}) + \sum_{i}^{I} p_{i •} \sum_{j}^{J} p_{• j} log (p_{ij}) \end{array}$ .

We have noted that a_ij is equivalent to the residual of the two-way analysis of variance. The same matrix A = (a_ij), used in RC (M) association model, is analysed in Log-ratio analysis (Greenacre, 2009). Greenacre, starting from Correspondence Analysis (CA) and using Box-Cox transformation of $p_{ij}^{α}$ (with α → 0) applied a SVD on the following matrix

\begin{array}{l} Z = D_{r}^{1 / 2} (I - 1 r^{T}) L (D_{r}^{- 1 / 2} P D_{c}^{1 / 2}) {(I - 1 c^{T})}^{T} D_{c}^{1 / 2} \\ = D_{r}^{1 / 2} (I - 1 r^{T}) L (N) {(I - 1 c^{T})}^{T} D_{c}^{1 / 2} \\ = D_{r}^{1 / 2} A D_{c}^{1 / 2} \end{array}

where L means logarithm transformation. Based on the different centring system, a comparison among CA, weighted LRA, and RC (M) has been done (Greenacre and Lewi 2009). For analogical criteria the weighted system of weighted LRA is the same of CA. This choice could be justified in a better way as follows.

Considering N_ij, when τ_ij → + ∞, then $\frac{(N_{ij} ‒ τ_{ij})}{\sqrt{τ_{ij}}}$ is a random variable with normal standard distribution^a. If we consider the random variable $\sqrt{τ_{ij}} [log (N_{ij}) - log (τ_{ij})] = \sqrt{τ_{ij}} log [1 + \frac{N_{ij} - τ_{ij}}{τ_{ij}}]$ , applying the Taylor series, we can say that $[\frac{N_{ij} - τ_{ij}}{τ_{ij}}]$ provides a useful approximation to $log [1 + \frac{N_{ij} - τ_{ij}}{τ_{ij}}]$ when $|\frac{N_{ij} - τ_{ij}}{τ_{ij}}| < 1$ , thus $log [1 + \frac{N_{ij} - τ_{ij}}{τ_{ij}}] = \sum_{t}^{\infty} \frac{{(- 1)}^{t + 1}}{t} {[\frac{N_{ij} - τ_{ij}}{τ_{ij}}]}^{t} \approx \frac{N_{ij} - τ_{ij}}{τ_{ij}}$ . Observing that $\sqrt{τ_{ij}} \frac{N_{ij} - τ_{ij}}{τ_{ij}} = \frac{N_{ij} - τ_{ij}}{\sqrt{τ_{ij}}} ~ N (0, 1)$ , it follows that $\sqrt{τ_{ij}} [log (N_{ij}) - log (τ_{ij})] = \sqrt{τ_{ij}} log [1 + \frac{N_{ij} - τ_{ij}}{τ_{ij}}] ~ Ν (0, 1)$ , then E[log(N_ij)] ≅ log(τ_ij) and Var[log(N_ij)] ≅ 1/τ_ij. Under the independence hypothesis τ_ij can be estimated by np_i •p_• j, justifying the weighting system based on row and column marginal totals of P.

For the foregoing the association between the categories of X and Y variables could be studied, by performing a SVD of the double-centred matrix Z with respect to p_i • and p_• j, : $Z = UΛ V^{T} = \sum_{m = 1}^{M} u_{m} λ_{m} ν_{m}^{T}$ where M = rank(Z) = min[(I − 1), (J − 1)] with U^TU = V^TV = I where u_m is the m-th column of U, v_m is the m-th column of V and the singular values down the diagonal of Λ are in descending order λ₁ ≥ λ₂≥.... ≥ λ_M. The total variance in weighted LRA can be written in terms of the complete set of the log ORs

tr (Z^{T} Z) = \sum_{i < i'} \sum \sum_{j < j'} \sum p_{i •} p_{i' •} p_{• j} p_{• j'} {[log O R_{ii' jj'}]}^{2} .

The principal and standard coordinates for rows and columns are computed as follows:

$F = D_{r}^{- 1 / 2} UΛ$ , $G = D_{c}^{- 1 / 2} VΛ$ , $\tilde{F} = D_{r}^{- 1 / 2} U$ , $\tilde{G} = D_{c}^{- 1 / 2} V$ .

Setting r = (1/I)1 and c = (1/J)1 we obtain the unweighted Log Ratio Analysis (Aitchison 1990)^b:

Z^{U} = {(IJ)}^{- \frac{1}{2}} (I - (1 / I) 1 1^{T}) L (N) (I - (1 / J) 1 1^{T})

The LRA decomposes Altham’s measure for Q = 2 and the complete set of ORS, in fact:

tr ({(Z^{U})}^{T} Z^{U}) = {(\frac{1}{\tilde{I} \tilde{J}} {\sum_{\tilde{i}} \sum_{\tilde{j}} [log s_{\tilde{i} \tilde{j}}]}^{Q})}^{1 / 2}

Althman’s index is an association measure based on the ORs. It can also be defined on the local odds ratio and spanning cell odds ratios. For a deep discussion of this measure, see Edwardes and Baltzan (2000).

Weighted LRA properties

The weighted LRA preserves the underlying properties the fundamental characteristics of classical CA: coordinates properties, distance measures, a reconstitution formula, rank of decomposed matrix. It is a powerful tool for analysing compositional data (Aitchison and Greenacre 2002). The weighted LRA to the specific case of log ORs has been introduced. As in RC (M) models (de Rooij and Heiser 2005) the row and column coordinates of the weighted LRA satisfy the following two important properties:

O R_{ii' jj'} = exp (\sum_{m = 1}^{M} λ_{m} ({\tilde{f}}_{im} - {\tilde{f}}_{i' m}) ({\tilde{g}}_{jm} - {\tilde{g}}_{j' m}))

(1)

O R_{ii' jj'} = exp (\frac{1}{2} d^{2} (f_{i}, g_{j}) + \frac{1}{2} d^{2} (f_{i}, g_{j}) - \frac{1}{2} d^{2} (f_{i}, g_{j}) - \frac{1}{2} d^{2} (f_{i}, g_{j}))

(2)

Where d²(f_i, g_j) is the squared Euclidean distance between the points with coordinates f_i and g_j on the m dimensions. Thanks to these properties the factorial representation of the weighted LRA can be explained both in terms of inner product rule (type I), and distance rule (type II). For type I representation, at least one coordinate set should be drawn using vectors, and the points of the other set projected on these vectors to represent the relationship. For type II the categories for both sets can be represented by points in Euclidean space, with the distance between the points describing the relationship between categories of two sets. These properties permit to visualize in the same factorial plan the categories and the log-ORs. Unfortunately, these important properties do not work for unweighted LRA.

Let ${\tilde{f}}_{i *}$ and ${\tilde{g}}_{j *}$ be the baseline of row and column score vectors, respectively. Substituting these baselines for the average with respect to i and j, in this case zero vectors ( ${\tilde{f}}_{i *} = 0$ and ${\tilde{g}}_{j *} = 0$ ), and taking into account formula (1), the OR of the pair of categories ij-th respect the baseline (Eshima et al., 2001) can be defined

O R_{{\tilde{f}}_{i} 0 {\tilde{g}}_{j} 0} = exp (\sum_{m = 1}^{M} λ_{m} {\tilde{f}}_{im} {\tilde{g}}_{jm})

This OR is theoretical and could be interpreted as the contribution of the pair of categories ij-th towards the association between X and Y variables. Considering log transformation and using vectors, the previous quantity can be written as

log O R_{{\tilde{f}}_{i} 0 {\tilde{g}}_{j} 0} = {\tilde{f}}_{i} Λ {\tilde{g}}_{j}^{T} .

Denoting by $\bar{S}$ the matrix of dimension I × J, whose generic element is $log O R_{{\tilde{f}}_{i} 0 {\tilde{g}}_{j} 0}$ , we have $\bar{S} = \tilde{F} Λ {\tilde{G}}^{T}$

In order to compute a synthesis measure of the complete set of ORs, the OR mean (Me) can be calculated by using formula (2)

\begin{array}{l} Me (O R_{ii' jj'}) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \sum_{m = 1}^{M} exp (\frac{1}{2} [{(f_{i' m} - g_{jm})}^{2} p_{i' j} \\ + {(f_{im} - g_{j' m})}^{2} p_{ij'} - {(f_{im} - g_{j' m})}^{2} p_{ij} - {(f_{im} - g_{j' m})}^{2} p_{i' j'}]) \end{array}

Replacing f_i ' m and g_j ' m by means with respect to i’ and j’, in this case zero vectors, we have:

\begin{array}{c} Me (O R_{i 0 j 0}) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \sum_{m = 1}^{M} exp (\frac{1}{2} [{(0 - g_{jm})}^{2} p_{i' j} \\ + {(f_{im} - 0)}^{2} p_{ij'} - {(f_{im} - g_{im})}^{2} p_{ij} - {(0 - 0)}^{2} p_{i' j'} [) \\ = \sum_{m = 1}^{M} \sum_{i = 1}^{I} \sum_{j = 1}^{J} f_{im} g_{jm} p_{ij} \end{array}

Using standard coordinates we obtain

Me (O R_{i 0 j 0}) = \sum_{m = 1}^{M} λ_{m} \sum_{i = 1}^{I} \sum_{j}^{J} {\tilde{f}}_{im} {\tilde{g}}_{jm} p_{ij} = \sum_{m = 1}^{M} λ_{m} ρ_{m}

(3)

In the RC (M) model, this quantity is expressed by the Kullback–Leibler information (Eshima et al., 2001). This property is also true for the weighted LRA:

\begin{array}{c} Me (O R_{i 0 j 0}) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} p_{ij} log (p_{ij} / p_{i •} p_{• j}) \\ + \sum_{i = 1}^{I} \sum_{j = 1}^{J} p_{i •} p_{• j} log (p_{i •} p_{• j} / p_{ij}) \\ = \sum_{i = 1}^{I} \sum_{j = 1}^{J} p_{ij} log p_{ij} - \sum_{i = 1}^{I} \sum_{j = 1}^{J} p_{i •} p_{• j} log p_{ij} \\ = \sum_{i = 1}^{I} \sum_{j = 1}^{J} (p_{ij} - p_{i •} p_{• j}) log p_{ij} \end{array}

This quantity shows the departure of the assumption of independence. As Me(OR_{i 0j 0}) is expressed by the Kullback–Leibler information, the larger this mean is, the stronger the association between X and Y is. Dividing this mean by the sum of singular values, an index for studying the relationship between the X and Y variables based on the log ORs is obtained:

I (X, Y) = \sum_{m = 1}^{M} λ_{m} ρ_{m} / \sum_{m = 1}^{M} λ_{m}

while the quantity $c_{m} = λ_{m} ρ_{m} / \sum_{m = 1}^{M} λ_{m}$ represents the contribution of the m-th pair of coordinate vectors to the relationship between the X and Y variables.

From LRA to log-odds Ratio Analysis

The weighted LRA permits the indirect representation of the log ORs. Starting from the de Rooij and Anderson approach (2007) we have proposed a methodology based on the singular value decomposition of log OR matrix for obtaining its direct representation. Summary measures of association were also obtained. Unlike de Rooij and Anderson, the method has been applied to the matrix with the complete set of log odds ratios because, as seen later, it is linked to LRA (see below).

Let L(S) be a two-way table of dimension $\tilde{I} \times \tilde{J}$ containing the complete set of log ORs, in this table the rows (resp. columns) are formed by all pairs of categories of X (resp. Y). Let B and D be two square diagonal matrices of dimensions $\tilde{I}$ and $\tilde{J}$ respectively with general terms $1 / \tilde{I}$ and $1 / \tilde{J}$ Performing a SVD of L(S) with the matrices B and D, we have:

B^{1 / 2} L (S) D^{1 / 2} = UΛ V^{T}

We have called this analysis Unweighted Log Odd Ratio analysis (ULORA). ULORA is linked to unweighted LRA, and, consequently, is joined with the Altman measure:

\begin{array}{l} tr (B^{1 / 2} L (S) D L (S^{T}) B^{1 / 2}) = tr ({(Z^{U})}^{T} Z^{U}) \\ = {(\frac{1}{\tilde{I} \tilde{J}} \sum_{\tilde{i}} \sum_{\tilde{j}} [log s_{\tilde{i} \tilde{j}}])}^{2} . \end{array}

This method does not take into account the weight structures of the rows and columns. Let $\tilde{B}$ and $\tilde{D}$ be two square diagonal matrices of dimensions $\tilde{I}$ and $\tilde{J}$ respectively with general terms p_i •p_i ' • and p_• jp_• j '. Performing a SVD of L(S) with the matrices $\tilde{B}$ and $\tilde{D}$ we get a weighted analysis of the log OR matrix (WLORA):

{\tilde{B}}^{1 / 2} L (S) {\tilde{D}}^{1 / 2} = UΛ V^{T}

The coordinates are:

We obtain a factorial representation in which pairs of categories of X and Y are drawn. Following this approach, at least one coordinate set should be drawn using vectors, and the points of the other set projected on these vectors to represent the weighted ORs. In this case we show that:

\begin{array}{l} tr (Z^{T} Z) = tr ({\tilde{B}}^{1 / 2} L (S) \tilde{D} L (S^{T}) {\tilde{B}}^{1 / 2}) \\ = \sum_{i < i'} \sum \sum_{j < j'} \sum p_{i •} p_{i' •} p_{• j} p_{• j'} {[log O R_{ii' jj'}]}^{2} . \end{array}

Therefore both the weighted LRA and ULORA decompose a synthetic measure of the log ORs. This last one is a weighted version of Altham’s measure.

Results and discussion

The association between TPAHs (X) and sampling points (Y) is significative, with Pearson’s chi-squared equal to 687.017. The complete set of log ORs is computed (Table 1).

Table 1 Complete set of log ORs

Full size table

The log ORs are very different from 0, therefore the association is confirmed. The synthesis of the complete set of log ORs can be performed computing the Altham index and formula (3), the results are 0.72158 and 0.62545, respectively.

Afterwards the unweighted and weighted LRA of the two-way contingency table have been performed. The number of dimensions to be retained is two and the factorial representations have been presented (Figure 1, Figure 2). In these representations we have the categories of X and Y. In Figure 1 we can observe that on the first axis there is a juxtaposition between a low level of pollution and a medium-high level of pollution, with “Source” and “500 meters” associated with a low level of pollution and the other categories of X (“River”, “50 meters” and “150 meters”) linked with a medium-high level of pollution. Figure 2, instead, appears more readable and interesting thanks to the effect of the weighting system. In fact “Source” and “500 meters” remain at a low level of pollution, but the other group is further divided in two more homogeneous groups: “River” associated with a high level of pollution and “50-150 meters” with a medium level of pollution.

In order to improve the data analysis we have divided the data table into three sub-tables, in which the last three categories of X have been further subdivided into three sub-levels concerning the direction of detection (North-West, West, South-West). This division was made because it was found that the level of pollution of the sea is influenced by the direction of detection. In this paper we have showed only the direction South-West (for the other analysis the authors can be contacted). In factorial representation of unweighted LRA (Figure 3) three associations are clear: “500 meters” with a low level of pollution, “50 and 150 meters” with a medium level of pollution and “River” with a high level of pollution. The position of category “Source” is ambiguous, in fact it is in the middle between low and high level of pollution. In our opinion, this ambiguity depends on the different values of the margins of the data table. In order to take into account this feature of the data, the weighted LRA, that allows us to include a system of weights, has been performed. The factorial representation (Figure 4) is better, in fact the classification of the ambiguous category “Source” has been resolved and is correctly associated with a low level of pollution. As the marginal relative frequencies of data table (rows: 0.614, 0.284, 0.102; columns: 0.149, 0.419, 0.145, 0.143, 0,144) are different, then weighted LRA is preferred. Moreover, for weighted LRA, the indirect representation of log-ORs can be appreciated. For example considering log OR_L,M;S,R and log OR_L,H;R,500, according to formula (2) the log-OR depends on the length of the distances between categories. In our case log OR_L,M;S,R is clearly greater than 1, since the solid lines are much longer than the dotted ones, for the second log OR_L,H;R,500 the contrary happens therefore it is smaller than 1. We fitted the RC (2) association model using the marginal proportions as weights. The results are very similar to those obtained by the weighted LRA.

Subsequently to study the association between X and Y we have considered the matrix of the complete set of log ORs (Table 2).

The ULORA and WLORA have been performed. The factorial representations, on retained axes, are in Figure 5 and Figure 6. At first glance, the factorial representations look very similar but differences exist, as a matter of fact in Figure 6 the category “SR” and “MH” have coordinates greater than in Figure 5.

Table 2 Complete set of log ORs (South-West)

Full size table

This is a consequence of the system of weights; in fact in WLORA we have taken into consideration the marginal relative frequencies of the rows and the columns as a weighting system. When there are large differences among the marginal relative frequencies of the rows or of the columns, it could be relevant to take this information into account, therefore the WLORA is preferred. ULORA and WLORA permits the visualization of the log-ORs through the projection of the column points onto the row vectors (or viceversa). For example, we have focalized our attention on three log-ORs computed using the same row vector: log OR_L,M;S,R, log OR_L,M;S,50 and log OR_L,M;S,150. The projections are different in the two analysis depending on the system of weights. In Figure 6 we can see that there is a strong association between the categories “Source-River” and Low-High level of pollution, “Source-50 meter” and Low-Medium level of pollution and “Source-150 meter” and Low-Medium level of pollution. This association can be justified both by the proximity of the categories and through the log-ORs (represented by projections of the first category on the second). Moreover, in this case it is also possible to make an interpretation in terms of variations. In fact, the log OR_L,H;S,R, tells us that when you switch from “Source” to “River” is very likely that the level of pollution jumps from Low to High, the same happens when we switch from category “Source” to “50 meters”, where it is very likely that the pollution goes from low to medium. Therefore, given the sequence of the measured points, it is possible to argue that passing from “River” to see the pollution decreases from high to medium.

Conclusions

In this paper we discussed the RC (M) association model and weighted LRA providing a justification for the logarithm transformation and weighting system. RC (M) association models and LRA are based on Newton–Raphson (NR) algorithm and the SVD, respectively. The convergence of the NR depends on the starting point. On the contrary the SVD is extremely stable and computationally simpler. Regarding the number of dimensions, the LRA allowed to choose the number of dimensions to be retained later. A criterion could be the variance explained by the first components. Another criterion could be the application of a bootstrap or jackknife procedure for verifying the stability of singular values. In the RC (M) association models, for a given dimensionality_, main effects and interaction terms were estimated. Then we extended LRA to the study of log-ORs, obtaining the indirect representation of the log ORs and its synthesis measures. Finally we applied the SVD to the unweighted and weighted Log-ORs matrix (ULORA and WLORA) obtaining a direct representation of log-ORs. We also got summary measures of association. The ULORA and WLORA were applied to the complete set of log-ORs for linking these methods to LRA.

In the further study we intend to extend the introduced methodologies to three-way contingency table (Gallo and Simonacci 2013), to the other types of ORs (i.e. cumulation, continuation and global) and to the ratio of two contingency tables. Considering that the two contingency tables are of same dimensions: one representing the target population, X_ij the second a subset of this population with a specific character, Y_ij. Let X_ij ~ Po(τ_ij) and Y_ij|X_ij = k_ij ~ Bin(k_ij, p_ij) be. One demonstrates Y_ij ~ Po(τ_ijp_ij). Then the methodologies presented could be extended to the analysis of ratios r_ij = y_ij/x_ij.

Endnotes

^aA continuity and symmetry correction can also be applied when one discrete distribution is approximated by the normal distribution.

^bOther weight systems can be applied, for example p_i • + p_• j for squared tables.

References

Aitchison JS: Relative variation diagrams for describing patterns of variability in compositional data. Math Geol 1990, 22: 487-512. 10.1007/BF00890330
Article Google Scholar
Aitchison J, Greenacre MJ: Biplots of compositional data. Appl Stat 2002, 51: 375-392.
Google Scholar
Altham PME: The measurement of association of rows and columns for an r x s contingency table. J R Stat Soc B 1970, 32: 63-73.
Google Scholar
Bihari N, Fafandel M, Hamer B, Kralj-Bilen B: PAH content, toxicity and genotoxicity of coastal marine sediments from the Rovinj area, Northern Adriatic, Croatia. Sci Total Environ 2007, 366: 602-611.
Article Google Scholar
Caricchia AM, Chiavarini S, Cremisini C, Marrini F, Morabito R: PAHs, PCBs and DDE in the Northern Adriatic Sea. Mar Pollut Bull 1993, 26: 581-583. 10.1016/0025-326X(93)90411-C
Article Google Scholar
Choulakian V: Exploratory analysis of contingency tables by loglinear formulation and generalization o Correspondence analysis. Psychometrika 1988, 53(2):235-250. 10.1007/BF02294135
Article Google Scholar
D’Ambra L: Least squares criterion for asymmetric dependence models in three-way contingency table. Unité de biometrie, Montpellier technical report n° 8802. 1988.
Google Scholar
de Rooij M, Anderson CJ: Visualizing, Summarizing, and Comparing Odds Ratio Structures. Methodology: Eur J Res Methods Behav Soc Sci 2007, 3(4):139-148.
Article Google Scholar
de Rooij M, Heiser WJ: Graphical representations and odds ratios in a distance-association model for the analysis of cross-classified data. Psychometrika 2005, 70: 99-123. 10.1007/s11336-000-0848-1
Article Google Scholar
Edwardes MD, Baltzan M: The generalization of the odds ratio, risk ratio and risk difference to r × k tables. Stat Med 2000, 19: 1901-1914. 10.1002/1097-0258(20000730)19:14<1901::AID-SIM514>3.0.CO;2-V
Article Google Scholar
Escoufier Y, Junga S: Least squares approximation of frequencies or their logarithms. Int Stat Rev 1986, 54(3):279-283. 10.2307/1403057
Article Google Scholar
Eshima N, Tubata M, Tsujitani M: Property of the RC (M) association model and a summury measure of association in the contingency table. J Japan Stat Soc 2001, 31(1):15-26. 10.14490/jjss1995.31.15
Article Google Scholar
Gallo M, Simonacci V: A procedure for three analysis of compositions. Electron J Appl Stat Anal 2013, 06(02):202-210.
Google Scholar
Goodman LA: Simple models for the analysis of association in cross classifications having ordered categories. J Am Stat Assoc 1979, 74: 537-552. 10.1080/01621459.1979.10481650
Article Google Scholar
Goodman LA: The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetric models for contingency tables with or without missing entries. Ann Stat 1985, 13: 10-69. 10.1214/aos/1176346576
Article Google Scholar
Greenacre MJ: Power transformations in correspondence analysis. Comput Stat Data Anal 2009, 53(8):3107-3116. 10.1016/j.csda.2008.09.001
Article Google Scholar
Greenacre MJ, Lewi PJ: Distributional equivalence and subcompositional coherence in the analysis of compositional data, contingency table and ratio scale measurement. J Classif 2009, 26(1):29-54. 10.1007/s00357-009-9027-y
Article Google Scholar
Lehr RE, Jerima DM: Metabolic activations of polycyclic hydrocarbons. Arch Environ Contam Toxicol 1997, 39: 1-6.
Google Scholar
Luo XJ, Chen SJ, Mai BX, Yang QS, Sheng GY, Fu JM: Polycyclic aromatic hydrocarbons in suspended particulate matter and sediments from the Pearl River Estuary and adjacent coastal areas, China. Environ Pollut 2006, 139: 9-20. 10.1016/j.envpol.2005.05.001
Article Google Scholar
Sarnacchiaro P, Diez S, Montuori P: Polycyclic Aromatic Hydrocarbons Pollution in a Coastal Environment: the Statistical Analysis of Dependence to Estimate the Source of Pollution. Curr Anal Chem 2012, 8: 300-309. 10.2174/157341112800392607
Article Google Scholar
Tolosa I, Bayona JM, Albaigés J: Aliphatic and polycyclic aromatic hydrocabons and sulfur/oxygen derivatives in northwestern Mediterranean sediments: spatial and temporal vatriability, fluxes and budgets. Environ Sci Technol 1995, 29: 2519-2527. 10.1021/es00010a010
Article Google Scholar
White KL: An overview of immunotoxicology and polycyclic aromatic hydrocarbons. Environ Carcinogenesis Rev 1986, 2: 163-202.
Article Google Scholar
Xu J, Yu Y, Wang P, Guo W, Dai S, Sun H: Polycyclic aromatic hydrocarbons in the surface sediments from Yellow River, China. Chemosphere 2007, 67: 1408-1414. 10.1016/j.chemosphere.2006.10.074
Article Google Scholar
Yan LS: Study of carcinogenic mechanisms for aromatic hydrocarbons – extended bay region theory and its quantitative model. Carcinogenesis 1985, 6: 1-6. 10.1093/carcin/6.1.1
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Economics, Second University of Naples, Corso Gran Priorato di Malta, 81043, Capua, CE, Italy
Ida Camminatiello & Antonello D’Ambra
University of Rome Unitelma Sapienza, Viale Regina Elena 295, 00161, Rome, Italy
Pasquale Sarnacchiaro

Authors

Ida Camminatiello
View author publications
You can also search for this author in PubMed Google Scholar
Antonello D’Ambra
View author publications
You can also search for this author in PubMed Google Scholar
Pasquale Sarnacchiaro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ida Camminatiello.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Camminatiello, I., D’Ambra, A. & Sarnacchiaro, P. The association in a two-way contingency table through log odds ratio analysis: the case of Sarno river pollution. SpringerPlus 3, 384 (2014). https://doi.org/10.1186/2193-1801-3-384

Download citation

Received: 27 March 2014
Accepted: 30 June 2014
Published: 28 July 2014
DOI: https://doi.org/10.1186/2193-1801-3-384

The association in a two-way contingency table through log odds ratio analysis: the case of Sarno river pollution

Abstract

Introduction

Materials and methods

The research plan - study area, sampling points

Log-ratio and log odds ratio analysis

Notations

From association model to log Ratio Analysis

Weighted LRA properties

From LRA to log-odds Ratio Analysis

Results and discussion

Conclusions

Endnotes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords