The association in a two-way contingency table through log odds ratio analysis: the case of Sarno river pollution
© Camminatiello et al.; licensee Springer. 2014
Received: 27 March 2014
Accepted: 30 June 2014
Published: 28 July 2014
In this paper we are proposing a general framework for the analysis of the complete set of log Odds Ratios (ORs) generated by a two-way contingency table. Starting from the RC (M) association model and hypothesizing a Poisson distribution for the counts of the two-way contingency table we are obtaining the weighted Log Ratio Analysis that we are extending to the study of log ORs. Particularly we are obtaining an indirect representation of the log ORs and some synthesis measures. Then for studying the matrix of log ORs we are performing a generalized Singular Value Decomposition that allows us to obtain a direct representation of log ORs. We also expect to get summary measures of association too. We have considered the matrix of complete set of ORs, because, it is linked to the two-way contingency table in terms of variance and it allows us to represent all the ORs on a factorial plan. Finally, a two-way contingency table, which crosses pollution of the Sarno river and sampling points, is to be analyzed to illustrate the proposed framework.
Polycyclic aromatic hydrocarbons (PAHs) are a group of lipophilic contaminants widespread in the environment. This class of compounds has been widely studied (Tolosa et al., 1995; Caricchia et al., 1993) because of its carcinogenic and mutagenic properties (Lehr and Jerima, 1997; Yan, 1985; White 1986).
PAHs are produced by both anthropogenic and natural processes and can be introduced into the environment through various routes. Anthropogenic inputs can originate from the incomplete combustion of organic matter (pyrolytic) and the discharge of crude oil-related material (petrogenic). PAHs can also originate from natural processes such as short-term diagenetic degradation of biogenic precursors (diagenesis). Each source (i.e. pyrolytic, petrogenic and diagenetic) gives rise to characteristic PAH patterns. Currently, the interest in multivariate statistical methodologies for identifying the main sources of PAH pollution and for quantifying the incidence of each source of pollution on total pollution levels, particularly in coastal environments, is increasing (Luo et al., 2006; Bihari et al., 2007; Xu et al., 2007; Sarnacchiaro et al., 2012). This study is part of a large project which has the objective of enhancing the knowledge of pollution in the Sarno River and its environmental impact on the gulf of Naples. This project has attempted to assess the pollution derived from local industries, agriculture and urban impact (Sarnacchiaro et al., 2012). In the present work, we have studied the association between the level of PAH pollution and the sampling points.
The analysis of the association for variables placed in a I × J two-way contingency table is a topic widely discussed. In this paper we focus our attention on the Odds Ratios (ORs) as measure of association. In a two-way contingency table the total number of ORs, that can be computed, may be too large, for their synthesis four main alternatives or complementary strategies have been performed. The first consists in the computation of statistical measures (Altham, 1970). The second is based on the construction of the model for frequencies and studies the ORs through the interaction between the row and column variables. The log-linear model for two-way contingency table belongs to this class. The third solution is the RC (M) association model (Goodman, 1985), which is more parsimonious than the usual log-linear model (Choulakian, 1988). The fourth strategy takes in consideration Singular Value Decomposition (SVD) of the matrix containing the basic set of log ORs (de Rooij and Anderson 2007).
In this paper we have proposed a general framework for the analysis of the complete set of log ORs generated by a two-way contingency table. The road map of the general framework is the following: we started from the RC (M) association model and we hypothesized a Poisson distribution for the counts of the two-way contingency table. Then for parameter estimations of RC (M) we use an alternative approach based on least squares. The matrix for the estimation of bilinear part of RC (M) has been linked to the matrix used in log-ratio analysis (Greenacre, 2009), moreover we have extended log-ratio analysis (LRA) to the study of log ORs. Then we have connected these methodologies with De Rooiij and Anderson’s approach and we have introduced some important properties that have allowed us to have deeper knowledge on the association between variables troughs ORs. Differently from De Rooiij and Anderson, in our approach we have chosen to consider the matrix of a complete set of ORs, because, as we will show, this matrix is linked to the two-way contingency table and it allows us to represent all the ORs on a factorial plan. Moreover, the spanning cell odds ratios are useful when one of the categories defines a control or reference group. In that case, all other categories are described against this reference group. The local odds ratios are useful for ordinal variables when all local odds ratios are larger or equal to 1 (de Rooij and Anderson 2007).
Materials and methods
The research plan - study area, sampling points
Nicknamed “the most polluted river in Europe”, the Sarno River originates in south-western Italy and has a watershed of about 715 km2. An intensive sampling campaign was conducted in the spring of 2008. Surface sediment samples were collected at four locations along the Sarno river (near the source of the river, just before and after the junction with Alveo Comune and at the river mouth) and nine points in the continental shelf around the river mouth (three points, one for each direction North-West, West and South-West, were sampled 50 m from the Sarno River mouth, another three points 150 m away and, finally, another three points 500 m from the river mouth). The collected data were arranged in a two-way contingency table. The row variable is TPAHs (X) with three categories: Low (L), Medium (M), High (H) and column variable is the sampling points (Y) with five categories: Source (S), River (R), 50 m from the Sarno River mouth (50 m), 150 m from the Sarno River mouth (150 m), 500 m from the Sarno River mouth (500 m).
Log-ratio and log odds ratio analysis
Let N = (n ij ) be a two-way contingency table that cross-classifies n units according to I row categories and J column categories of X and Y variables, respectively. Let X i and Y j be the i-th and j-th category of X and Y and let π ij the probability that X = X i and Y = Y j . The matrix of proportions is denoted by P = n− 1N with general term p ij . The marginal relative frequencies of the i- th row and j- th column of P are pi • and p• j and they may be represented in vector or matrix form. In this paper, the vector r (resp. c) consists of pi • (resp. p• j) as elements, while D r (resp. D c ) is the diagonal matrix of these quantities.
Let be the OR, the complete set of ORs for table N is composed by [I(I − 1)]/2 × [J(J − 1)]/2 ORs and it can be placed in a two-way table, called , of dimension , where and .
From association model to log Ratio Analysis
where μ im and ν jm are X i and Y j scores on dimension m (standard coordinates), φ m is a measure of the strength of the association between X and Y, α i and β j are the main effects of X and Y, respectively. With respect to the scores, the following constraints are assumed:, and where and .
Assuming the previous constraints and that the distribution of counts within IJ categories is a multinomial distribution with parameters n and π ij , the parameter estimation is computed by the maximum likelihood method.
An alternative estimation method is based on the least square procedure. Let N ij ~ Po(nπij = τ ij ) be a random variable, if we perform the logarithm transformation we obtain the difference log(N ij /n) − log(π ij ).
Substituting the probabilities with observed frequency, taking into account the constraints and the condition, we estimate the parameters log(β j ) and log(α i ) as follows: and .
where L means logarithm transformation. Based on the different centring system, a comparison among CA, weighted LRA, and RC (M) has been done (Greenacre and Lewi 2009). For analogical criteria the weighted system of weighted LRA is the same of CA. This choice could be justified in a better way as follows.
Considering N ij , when τ ij → + ∞, then is a random variable with normal standard distributiona. If we consider the random variable , applying the Taylor series, we can say that provides a useful approximation to when , thus . Observing that , it follows that , then E[log(N ij )] ≅ log(τ ij ) and Var[log(N ij )] ≅ 1/τ ij . Under the independence hypothesis τ ij can be estimated by npi •p• j, justifying the weighting system based on row and column marginal totals of P.
The principal and standard coordinates for rows and columns are computed as follows:
, , , .
Althman’s index is an association measure based on the ORs. It can also be defined on the local odds ratio and spanning cell odds ratios. For a deep discussion of this measure, see Edwardes and Baltzan (2000).
Weighted LRA properties
Where d2(f i , g j ) is the squared Euclidean distance between the points with coordinates f i and g j on the m dimensions. Thanks to these properties the factorial representation of the weighted LRA can be explained both in terms of inner product rule (type I), and distance rule (type II). For type I representation, at least one coordinate set should be drawn using vectors, and the points of the other set projected on these vectors to represent the relationship. For type II the categories for both sets can be represented by points in Euclidean space, with the distance between the points describing the relationship between categories of two sets. These properties permit to visualize in the same factorial plan the categories and the log-ORs. Unfortunately, these important properties do not work for unweighted LRA.
Denoting by the matrix of dimension I × J, whose generic element is , we have
while the quantity represents the contribution of the m-th pair of coordinate vectors to the relationship between the X and Y variables.
From LRA to log-odds Ratio Analysis
The weighted LRA permits the indirect representation of the log ORs. Starting from the de Rooij and Anderson approach (2007) we have proposed a methodology based on the singular value decomposition of log OR matrix for obtaining its direct representation. Summary measures of association were also obtained. Unlike de Rooij and Anderson, the method has been applied to the matrix with the complete set of log odds ratios because, as seen later, it is linked to LRA (see below).
Therefore both the weighted LRA and ULORA decompose a synthetic measure of the log ORs. This last one is a weighted version of Altham’s measure.
Results and discussion
Complete set of log ORs
The log ORs are very different from 0, therefore the association is confirmed. The synthesis of the complete set of log ORs can be performed computing the Altham index and formula (3), the results are 0.72158 and 0.62545, respectively.
Subsequently to study the association between X and Y we have considered the matrix of the complete set of log ORs (Table 2).
Complete set of log ORs (South-West)
This is a consequence of the system of weights; in fact in WLORA we have taken into consideration the marginal relative frequencies of the rows and the columns as a weighting system. When there are large differences among the marginal relative frequencies of the rows or of the columns, it could be relevant to take this information into account, therefore the WLORA is preferred. ULORA and WLORA permits the visualization of the log-ORs through the projection of the column points onto the row vectors (or viceversa). For example, we have focalized our attention on three log-ORs computed using the same row vector: log ORL,M;S,R, log ORL,M;S,50 and log ORL,M;S,150. The projections are different in the two analysis depending on the system of weights. In Figure 6 we can see that there is a strong association between the categories “Source-River” and Low-High level of pollution, “Source-50 meter” and Low-Medium level of pollution and “Source-150 meter” and Low-Medium level of pollution. This association can be justified both by the proximity of the categories and through the log-ORs (represented by projections of the first category on the second). Moreover, in this case it is also possible to make an interpretation in terms of variations. In fact, the log ORL,H;S,R, tells us that when you switch from “Source” to “River” is very likely that the level of pollution jumps from Low to High, the same happens when we switch from category “Source” to “50 meters”, where it is very likely that the pollution goes from low to medium. Therefore, given the sequence of the measured points, it is possible to argue that passing from “River” to see the pollution decreases from high to medium.
In this paper we discussed the RC (M) association model and weighted LRA providing a justification for the logarithm transformation and weighting system. RC (M) association models and LRA are based on Newton–Raphson (NR) algorithm and the SVD, respectively. The convergence of the NR depends on the starting point. On the contrary the SVD is extremely stable and computationally simpler. Regarding the number of dimensions, the LRA allowed to choose the number of dimensions to be retained later. A criterion could be the variance explained by the first components. Another criterion could be the application of a bootstrap or jackknife procedure for verifying the stability of singular values. In the RC (M) association models, for a given dimensionality, main effects and interaction terms were estimated. Then we extended LRA to the study of log-ORs, obtaining the indirect representation of the log ORs and its synthesis measures. Finally we applied the SVD to the unweighted and weighted Log-ORs matrix (ULORA and WLORA) obtaining a direct representation of log-ORs. We also got summary measures of association. The ULORA and WLORA were applied to the complete set of log-ORs for linking these methods to LRA.
In the further study we intend to extend the introduced methodologies to three-way contingency table (Gallo and Simonacci 2013), to the other types of ORs (i.e. cumulation, continuation and global) and to the ratio of two contingency tables. Considering that the two contingency tables are of same dimensions: one representing the target population, X ij the second a subset of this population with a specific character, Y ij . Let X ij ~ Po(τ ij ) and Y ij |X ij = k ij ~ Bin(kij, p ij ) be. One demonstrates Y ij ~ Po(τ ij pij). Then the methodologies presented could be extended to the analysis of ratios r ij = y ij /x ij .
aA continuity and symmetry correction can also be applied when one discrete distribution is approximated by the normal distribution.
bOther weight systems can be applied, for example pi • + p• j for squared tables.
- Aitchison JS: Relative variation diagrams for describing patterns of variability in compositional data. Math Geol 1990, 22: 487-512. 10.1007/BF00890330View ArticleGoogle Scholar
- Aitchison J, Greenacre MJ: Biplots of compositional data. Appl Stat 2002, 51: 375-392.Google Scholar
- Altham PME: The measurement of association of rows and columns for an r x s contingency table. J R Stat Soc B 1970, 32: 63-73.Google Scholar
- Bihari N, Fafandel M, Hamer B, Kralj-Bilen B: PAH content, toxicity and genotoxicity of coastal marine sediments from the Rovinj area, Northern Adriatic, Croatia. Sci Total Environ 2007, 366: 602-611.View ArticleGoogle Scholar
- Caricchia AM, Chiavarini S, Cremisini C, Marrini F, Morabito R: PAHs, PCBs and DDE in the Northern Adriatic Sea. Mar Pollut Bull 1993, 26: 581-583. 10.1016/0025-326X(93)90411-CView ArticleGoogle Scholar
- Choulakian V: Exploratory analysis of contingency tables by loglinear formulation and generalization o Correspondence analysis. Psychometrika 1988, 53(2):235-250. 10.1007/BF02294135View ArticleGoogle Scholar
- D’Ambra L: Least squares criterion for asymmetric dependence models in three-way contingency table. Unité de biometrie, Montpellier technical report n° 8802. 1988.Google Scholar
- de Rooij M, Anderson CJ: Visualizing, Summarizing, and Comparing Odds Ratio Structures. Methodology: Eur J Res Methods Behav Soc Sci 2007, 3(4):139-148.View ArticleGoogle Scholar
- de Rooij M, Heiser WJ: Graphical representations and odds ratios in a distance-association model for the analysis of cross-classified data. Psychometrika 2005, 70: 99-123. 10.1007/s11336-000-0848-1View ArticleGoogle Scholar
- Edwardes MD, Baltzan M: The generalization of the odds ratio, risk ratio and risk difference to r × k tables. Stat Med 2000, 19: 1901-1914. 10.1002/1097-0258(20000730)19:14<1901::AID-SIM514>3.0.CO;2-VView ArticleGoogle Scholar
- Escoufier Y, Junga S: Least squares approximation of frequencies or their logarithms. Int Stat Rev 1986, 54(3):279-283. 10.2307/1403057View ArticleGoogle Scholar
- Eshima N, Tubata M, Tsujitani M: Property of the RC (M) association model and a summury measure of association in the contingency table. J Japan Stat Soc 2001, 31(1):15-26. 10.14490/jjss1995.31.15View ArticleGoogle Scholar
- Gallo M, Simonacci V: A procedure for three analysis of compositions. Electron J Appl Stat Anal 2013, 06(02):202-210.Google Scholar
- Goodman LA: Simple models for the analysis of association in cross classifications having ordered categories. J Am Stat Assoc 1979, 74: 537-552. 10.1080/01621459.1979.10481650View ArticleGoogle Scholar
- Goodman LA: The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetric models for contingency tables with or without missing entries. Ann Stat 1985, 13: 10-69. 10.1214/aos/1176346576View ArticleGoogle Scholar
- Greenacre MJ: Power transformations in correspondence analysis. Comput Stat Data Anal 2009, 53(8):3107-3116. 10.1016/j.csda.2008.09.001View ArticleGoogle Scholar
- Greenacre MJ, Lewi PJ: Distributional equivalence and subcompositional coherence in the analysis of compositional data, contingency table and ratio scale measurement. J Classif 2009, 26(1):29-54. 10.1007/s00357-009-9027-yView ArticleGoogle Scholar
- Lehr RE, Jerima DM: Metabolic activations of polycyclic hydrocarbons. Arch Environ Contam Toxicol 1997, 39: 1-6.Google Scholar
- Luo XJ, Chen SJ, Mai BX, Yang QS, Sheng GY, Fu JM: Polycyclic aromatic hydrocarbons in suspended particulate matter and sediments from the Pearl River Estuary and adjacent coastal areas, China. Environ Pollut 2006, 139: 9-20. 10.1016/j.envpol.2005.05.001View ArticleGoogle Scholar
- Sarnacchiaro P, Diez S, Montuori P: Polycyclic Aromatic Hydrocarbons Pollution in a Coastal Environment: the Statistical Analysis of Dependence to Estimate the Source of Pollution. Curr Anal Chem 2012, 8: 300-309. 10.2174/157341112800392607View ArticleGoogle Scholar
- Tolosa I, Bayona JM, Albaigés J: Aliphatic and polycyclic aromatic hydrocabons and sulfur/oxygen derivatives in northwestern Mediterranean sediments: spatial and temporal vatriability, fluxes and budgets. Environ Sci Technol 1995, 29: 2519-2527. 10.1021/es00010a010View ArticleGoogle Scholar
- White KL: An overview of immunotoxicology and polycyclic aromatic hydrocarbons. Environ Carcinogenesis Rev 1986, 2: 163-202.View ArticleGoogle Scholar
- Xu J, Yu Y, Wang P, Guo W, Dai S, Sun H: Polycyclic aromatic hydrocarbons in the surface sediments from Yellow River, China. Chemosphere 2007, 67: 1408-1414. 10.1016/j.chemosphere.2006.10.074View ArticleGoogle Scholar
- Yan LS: Study of carcinogenic mechanisms for aromatic hydrocarbons – extended bay region theory and its quantitative model. Carcinogenesis 1985, 6: 1-6. 10.1093/carcin/6.1.1View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.