2.1 Data source
Surveillance, Epidemiology, and End Results (SEER) Program data of newly diagnosed cases of female breast cancer from the years 1990 to 2009 were analyzed. SEER is a high-quality, population-based incidence data covering up to 26% of the US population. During the period of study, N = 446,726 diagnosed cases of female breast cancer were registered in the nine SEER registries considered here, which include Atlanta, Connecticut, Detroit, Hawaii, Iowa, New Mexico, San Francisco-Oakland, Seattle-Puget Sound and Utah. Figure 1 depicts the histogram and kernel density curve of age-at-diagnosis for each year. Two key features are easily identified: a bimodal pattern and a notable change of both the shape and the location of the peaks. It can be noticed that, while the disease was more frequent in patients older than the inflection point during the first half of the 90’s, the diagnosis became more common amongst younger women in the following years.
The variables that were considered as predictors of age-at-diagnosis were trend, which was computed as the orthogonal polynomial base of first degree for year+(monthnumber−1)/12, and the factors that are defined in uppercase as follows: SITE, the histopathologic subtype: 1) duct carcinomas, obtained from codes 8500-8508 and 8521-8523 (80%), 2) lobular carcinomas, obtained from codes 8520 and 8524 (9%), and 3) Other (11%); ER, the estrogen receptor status: 1) positive (60%), and 2) negative or borderline (17%), with 23% missing; GRADE, tumor grade: 1) well differentiated (16%), 2) moderately differentiated (33%), and 3) poorly differentiated, undifferentiated or anaplastic (31%), with 20% missing; EXTENSION: 1) in situ or without underlying tumor or no evidence of it (18%), 2) confined to breast tissue and fat including nipple and/or areola (72%), and 3) Invasive components and further extension (8%), with 2% missing; LYMPH: 1) no lymph node involvement (70%), and 2) lymph node involvement (24%), with 6% missing; SIZE, tumor size: 1) less than 2cm (70%) and 2) 2cm or more (25%), with 5% missing; LATERALITY: 1) Right (49%), and 2) left (51%); RACE: 1) White (84%), 2) Black (9%) and 3) Asian or Pacific Islander (7%); and MARRIED, marital status at diagnosis: 1) Single (never married, 12%), and 2) Married (including common law), separated, divorced, widowed, unmarried or domestic partner (84%), with 4% missing. Only one-sided laterality cases were considered in the dataset since cases with breast cancer in both sides, which were rare (less than.5%), may have a different histopathology in each side. American Indians, Alaska natives and other unspecified races were not included in the study either (less than.1%). Time-dependent variables were created by forming interaction terms trend by factor.
2.2 Statistical methods
The random variable Y is defined as age at diagnosis and the model-based clustering technique employed here consists of a two-component mixture model to estimate both the underlying component distributions and the memberships of the two unlabelled groups. Specifically, the cumulative distribution function (c.d.f.) of the mixture model is defined by a weighted sum of two Gaussian component c.d.f’s as follows:
(1)
where y takes all real values, Φ represents the c.d.f. of the standard normal, and the unknown mixture proportion π (0≤π≤1) along with the two within-cluster means μ1 and μ2, and the two within-cluster variances and are to be estimated; here the clusters indexed as 1 and 2 will be referred to as the young cluster and the old cluster, respectively. In order to include auxiliary variables in the split-population model in equation 1, the two within-cluster means and variances are specified as functions of covariates as
(2)
where x is the vector of explanatory variables which includes the intercept, β
k
and γ
k
are vectors of coefficients and k=1,2. In a similar way, the mixture proportion is specified by a logit function as follows
(3)
where δ is the vector of coefficients. A similar formulation has been proposed in Villani et al. 2009. Finite mixture models have been extensively discussed in Everitt and Hand 1981 and McLachlan and Basford 1988.
The type of estimators used in this study are obtained with maximum likelihood; here, the log-likelihood of the data {y1,…,y
n
} is , where g(y;θ) is the density function corresponding to the c.d.f. of the mixture model in equation 1 and θ is the vector that contains all unknown parameters. The log-likelihood can be maximized using general purpose optimizers to find the maximum likelihood estimators and the standard errors. In this study, the function nlm of the R language was used to optimize l(θ). The likelihood surface in the analysis presented here was well behaved and the optimizing procedure always led to the same solution for different starting values. The assumptions made about the mixture model may be checked by calculating the conditional randomized quantile residuals proposed by Dunn and Smyth (1996), which are defined by , where is the fitted cumulative distribution function and i=1,…,n. Since such residuals are exactly normal under the assumed model, some simple plots for checking that they are observed values of independent and standard normal random variates should indicate the quality of the fit.
The coding used to create the dummy variables corresponding to the levels of the factors was treatment contrasts (Chambers and Hastie 1992), which sets the coefficients of the baseline level in each categorical variable equal to 0; here, the baseline level is taken as the first category of the corresponding factor as described above. Since around 40% of the cases in the entire dataset contain at least a missing predictor, the principled method of dealing with the missing data employed here was multiple imputation. The inference procedure consisted of the generation of multiple stochastically “completed” datasets using the mice package in the R statistical language (van Buuren and Groothuis-Oudshoorn 2011), which uses a chained equations algorithm, then each completed data set was analyzed using the model for complete data, and finally the results were combined using Rubin’s rules (Rubin 1987).
Since the implementation of the chained equations algorithm is computationally expensive for large datasets, a random sample of size n = 20,000 was obtained, from which forty imputations were generated. The resulting five-number-summary of trend in the sample analyzed here was: minimum = −0.0132, 25th percentile = −0.0058, median=0.0003, 75th percentile=0.0061 and maximum=0.0116. It was assumed that the missing data mechanism is missing at random (MAR) (Little and Rubin 2002), which specifies that the probability that a data value is missing depends on values of variables that were actually measured. Including as many variables in the imputation model as possible yields multiple imputations that tend to minimize bias and make the MAR assumption more plausible, which reduces the need to make special adjustments for more complex missing data mechanisms (Schafer 1997).
The problem of variable selection in the mixture model was addressed with the “impute, then select” strategy, which involves initially performing multiple imputation and subsequently applying Bayesian variable selection to each of the enhanced datasets (Yang et al. 2005). The variables included in the final model appear in at least 50 per cent of the selected models obtained in the imputed datasets (Wood et al. 2008). To determine the most appropriate covariates to be included in the model of each imputed dataset, the Bayesian information criterion (BIC) (Schwarz 1978) was adopted as the main model choice criterion. If n
p
denotes the number of parameters in the model and n the number of individuals in the dataset, the BIC criterion is to choose the model for which is the smallest; here, is the maximized log-likelihood function. Backward elimination was employed to arrive at the best fitting model in each imputed dataset. Both the variable selection process and the combined results were based on the forty enhanced datasets.