P Deb, PK. Trivedi/Journal of Health Economics 21(2002)601-625 2.3. LCM In the lCm, the random variable is postulated as a draw from a population which is ar dditive mixture of C distinct subpopulations in proportions T1,..., Tc, where 2i=Ij 1,j200=1,., C). The mixture density for observation i,i=I,., n, is given by f(:1)=∑xf(v0)+xcf(1c),i=1,…,n (28) where each term in the sum on the right-hand side is the product of the mixing probability T and the component(subpopulation) density f; (ilej). The j are unknown constants that are estimated along with all other parameters, denoted 0. Also Ic=(1-2=lI).For identification( normalization), we use the labelling restriction that T≥m2≥…≥rc which can always be satisfied by rearrangement post estimation The component densities of the C-component finite mixture are specified as 入 f (iles) 厂(吵)rO+1)(入,+yj where j=1, 2, .., C are the latent classes, Aj. i exp(x' Bi)and vj. i=(1/a j)k.Note that(Bj, aj )are unrestricted across components The conditional mean of the count variable is given by E(ylx)=A=∑可 (2.10) and the variance by Vyx)=∑x+1+一对 (211) Both the mean and the variance in the lcm are in gener erent from their standard NB counterparts. The LCM can also accommodate over and underdispersed data relative to the NBM, but does so in a different manner than the TPM. Because this density is very flexible(permitting, for example, multimodal marginal distributions) and easily captures long-right tails, it is likely to accommodate patterns of overdispersion expected in our data 2. 4. Properties oflcM On LCMs offer a flexible way of specifying mixtures of densities. There are a number advantages of using a discrete rather than a continuous mixing distribution. First, the finite mixture representation provides a natural and intuitively attractive representation of 2 In general, T; may be parameterized as a function of covariates. However, such models are often fraught with identification problems if separating information is not available. However, if separating information is available, identification is feasible(Duan et al., 1983)
606 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 2.3. LCM In the LCM, the random variable is postulated as a draw from a population which is an additive mixture of C distinct subpopulations in proportions π1,... , πC, where C j=1 πj = 1, πj ≥ 0 (j = 1,... , C). The mixture density for observation i, i = 1,... , n, is given by f (yi|θ) = C −1 j=1 πjfj (yi|θj ) + πCfC(yi|θC), i = 1, . . . , n, (2.8) where each term in the sum on the right-hand side is the product of the mixing probability πj and the component (subpopulation) density fj (yi|θj ). The πj are unknown constants that are estimated along with all other parameters, denoted θ. 2 Also πC = (1−C−1 j=1 πj ). For identification (normalization), we use the labelling restriction that π1 ≥ π2 ≥ ··· ≥ πC, which can always be satisfied by rearrangement post estimation. The component densities of the C-component finite mixture are specified as fj (yi|θj ) = Γ (yi + ψj,i) Γ (ψj,i)Γ (yi + 1) ψj,i λj,i + ψj,i ψj,i λj,i λj,i + ψj,i yi , (2.9) where j = 1, 2,... , C are the latent classes, λj,i = exp(x iβj )and ψj,i = (1/αj )λk j,i. Note that (βj , αj )are unrestricted across components. The conditional mean of the count variable is given by E(yi|xi) = λ¯i = C j=1 πjλji (2.10) and the variance by V(yi|xi) = C j=1 πjλ2 ji[1 + αjλ−k ji ] + λ¯i − λ¯ 2 i . (2.11) Both the mean and the variance in the LCM are, in general, different from their standard NB counterparts. The LCM can also accommodate over and underdispersed data relative to the NBM, but does so in a different manner than the TPM. Because this density is very flexible (permitting, for example, multimodal marginal distributions) and easily captures long-right tails, it is likely to accommodate patterns of overdispersion expected in our data. 2.4. Properties of LCM LCMs offer a flexible way of specifying mixtures of densities. There are a number of advantages of using a discrete rather than a continuous mixing distribution. First, the finite mixture representation provides a natural and intuitively attractive representation of 2 In general, πj may be parameterized as a function of covariates. However, such models are often fraught with identification problems if separating information is not available. However, if separating information is available, identification is feasible (Duan et al., 1983).
P. Deb, PK Trivedi/Journal of Health Economics 21(2002)601-625 heterogeneity in a finite, usually small, number of latent classes, each of which may be regarded as a"type, or a"group". Second, the finite mixture approach is semiparametric it does not require any distributional assumptions for the mixing variable. The approach is an alternative to either nonp-arametric estimation or forcing the data through the straitjacket of a one-component parametric density. Third, the results of Laird(1978)and Heckman an Singer (1984)suggest that estimates of such finite mixture models may provide good numer ical approximations even if the underlying mixing distribution is continuous. The structure of the moments given above shows how the mixture model"decomposes" the information contained in a one-component model. Finally, the choice of a continuous mixing density for some parametric count models is sometimes restrictive and computationally intractable because the marginal density may not have an analytical solution. Note that in the NB model, the response of E(y)to a covariate x is fixed by exp(xB).If observations in the right tail have a different response to changes in x, the NB model could not capture that effect. The TPM loosens the parametric straitjacket by allowing different parameters for Pr(y = 0)and E(yly >0). However, the TPM is not likely to capture differential responses to changes in x in the right tail of the distribution because the re- sponse of E(yly >0)to a covariate x is fixed by exp(xB). In the LCM, the response of E()to a covariate x is determined by two or more sets of interactions between parame- ters and covariates(depending on the number of components), therefore is more likely to accommodate differential responsiveness A finite mixture characterization is especially attractive if the mixture components have a natural interpretation. However, this is not essential. A finite mixture may be simply a way of flexibly and parsimoniously modeling the data, with each mixture component providing a local approximation to some part of the true distribution. A caveat to the foregoing dise sion is that the lCm may fit the data better simply because outliers, infuential observations or contaminated observations are present in the data. The LCM model will capture this phe nomenon through additional mixture components. Hence it is desirable that the hypothesis of LCM should be supported both by a priori reasoning and by meaningful a posteriori differences in the behavior of latent classes 2.5. Maximum likelihood and cluster-robust standard errors Both TPM and LCM are estimated using(pseudo) maximum likelihood. The standarc TPM is computationally simple because the two parts of the likelihood function can be estimated separately. On the other hand, estimation of LCM is not straightforward. A comprehensive discussion of maximum likelihood estimation of the LCM model can be found in McLachlan and Peel (2000). The likelihood functions of finite mixture models can have multiple local maxima so it is important to ensure that the algorithm converges to the global maximum. In general, random perturbation or grid search techniques, or al gorithms such as simulated annealing( Goffe et al, 1994), designed to seek the global optimum, may be utilized. In this study, to ensure against the possibility of achievi 3 Lindsay (1995)provides a detailed theoretical analysis; Haughton( 1997) surveys computational issues and available software We thank an anonymous referee for suggesting this intuition
P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 607 heterogeneity in a finite, usually small, number of latent classes, each of which may be regarded as a “type”, or a “group”. Second, the finite mixture approach is semiparametric: it does not require any distributional assumptions for the mixing variable. The approach is an alternative to either nonp-arametric estimation or forcing the data through the straitjacket of a one-component parametric density. Third, the results of Laird (1978)and Heckman and Singer (1984) suggest that estimates of such finite mixture models may provide good numerical approximations even if the underlying mixing distribution is continuous. The structure of the moments given above shows how the mixture model “decomposes” the information contained in a one-component model.3 Finally, the choice of a continuous mixing density for some parametric count models is sometimes restrictive and computationally intractable because the marginal density may not have an analytical solution. Note that in the NB model, the response of E(y) to a covariate x is fixed by exp(xβ). If observations in the right tail have a different response to changes in x, the NB model could not capture that effect. The TPM loosens the parametric straitjacket by allowing different parameters for Pr(y = 0) and E(y|y > 0). However, the TPM is not likely to capture differential responses to changes in x in the right tail of the distribution because the response of E(y|y > 0) to a covariate x is fixed by exp(xβ). In the LCM, the response of E(y) to a covariate x is determined by two or more sets of interactions between parameters and covariates (depending on the number of components), therefore is more likely to accommodate differential responsiveness.4 A finite mixture characterization is especially attractive if the mixture components have a natural interpretation. However, this is not essential. A finite mixture may be simply a way of flexibly and parsimoniously modeling the data, with each mixture component providing a local approximation to some part of the true distribution. A caveat to the foregoing discussion is that the LCM may fit the data better simply because outliers, influential observations or contaminated observations are present in the data. The LCM model will capture this phenomenon through additional mixture components. Hence it is desirable that the hypothesis of LCM should be supported both by a priori reasoning and by meaningful a posteriori differences in the behavior of latent classes. 2.5. Maximum likelihood and cluster-robust standard errors Both TPM and LCM are estimated using (pseudo) maximum likelihood. The standard TPM is computationally simple because the two parts of the likelihood function can be estimated separately. On the other hand, estimation of LCM is not straightforward. A comprehensive discussion of maximum likelihood estimation of the LCM model can be found in McLachlan and Peel (2000). The likelihood functions of finite mixture models can have multiple local maxima so it is important to ensure that the algorithm converges to the global maximum. In general, random perturbation or grid search techniques, or algorithms such as simulated annealing (Goffe et al., 1994), designed to seek the global optimum, may be utilized. In this study, to ensure against the possibility of achieving 3 Lindsay (1995) provides a detailed theoretical analysis; Haughton (1997) surveys computational issues and available software. 4 We thank an anonymous referee for suggesting this intuition
P Deb, PK Trivedi/Journal of Health Economics 21(2002)601-625 false) convergence at a local maximum, each model was estimated using a number of different sets of starting values. No problems of convergence to local maxima were ob- served. Moreover, if a model with too many points of support is chosen, one or more points of support may be degenerate, i.e. the s associated with those densities may be zero. In such cases, the solution to the maximum likelihood problem lies on the bound- ry of the parameter space. This can cause estimation algorithms to fail, especially if unconstrained maximization algorithms are used. Constrained maximization algorithms are preferred. We estimate LCM models by maximum likelihood using the broyden- Fletcher-Goldfarb-Shanno quasi-Newton constrained maximization algorithm in SAS/IML (SAS Institute, 1997) Although the standard, sandwich variance formula(Cameron and Trivedi, 1998, p. 31) is robust to certain types of misspecification, it does not account for cluster effects in the sample. Because the rhie data are clustered by construction, adjustments for such clustering are desirable. To do so, the sandwich variance matrix formula is extended to the ase of clustered observations. v(6) a2 log fi G、(、 a log fy, a log fu a2 log fi where fi= f(ilri, e) is the data density function, Ng denotes the number of elements in group(cluster)g, G is the number of groups(clusters), and 6 denotes the vector of all unknown parameters. This approach has the advantage that no specific assumption about the form of intra-cluster dependence is necessary. Intuitively, one averages over the likelihood scores within each cluster instead of treating cluster elements as independent. In all goodness of fit( GoF) and hypothesis test statistics we use this"cluster-robustvariance in place of the“ standard formula 2.6. Model comparison and selection NBM is nested within TPM and LCM. but TPM and LCM are non-nested. Therefore we use three criteria to compare models in-sample, each of which is designed for the comparison of non-nested models. These include two traditional model selection criteria based on penal- Ized likelihood, Akaike Information Criterion, (AIC), and Bayesian Information Criterion, (BIC), which are valid even in the presence of model misspecification(Sin and White, 1996) Models with smaller values of the AIC =-In L+2K and BlC =-2 In L+K In(N) where In L is the maximized log likelihood, K is the number of parameters in the mode and N is the sample size, are preferred. We also use Andrew's GoF test(Andrews, 1988a,b) The GoF test is based on a x diagnostic statistic given by S=(( -f)where f-f is the n x q matrix of differences between sample and fitted cell frequencies, q is the number of cells created by partitioning of the data, and E is its estimated covariance matrix Under the null hypothesis of no misspecification the test has an asymptotic x2(-1)distri- bution(Andrews, 1988b). when the test statistic is formed using the maximum likelihood estimator, computation of the test statistic is simplified. Let A be the N x q matrix with ith row given by fi-fi, let B be the Nx K matrix with ith row given by (a/a8)log fi(i18)
608 P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 (false) convergence at a local maximum,each model was estimated using a number of different sets of starting values. No problems of convergence to local maxima were observed. Moreover, if a model with too many points of support is chosen, one or more points of support may be degenerate, i.e. the πs associated with those densities may be zero. In such cases, the solution to the maximum likelihood problem lies on the boundary of the parameter space. This can cause estimation algorithms to fail, especially if unconstrained maximization algorithms are used. Constrained maximization algorithms are preferred. We estimate LCM models by maximum likelihood using the Broyden– Fletcher–Goldfarb–Shanno quasi-Newton constrained maximization algorithm in SAS/IML (SAS Institute, 1997). Although the standard, sandwich variance formula (Cameron and Trivedi, 1998, p. 31) is robust to certain types of misspecification, it does not account for cluster effects in the sample. Because the RHIE data are clustered by construction, adjustments for such clustering are desirable. To do so, the sandwich variance matrix formula is extended to the case of clustered observations: V (ˆ θ) = N i=1 ∂2 log fi ∂θ ∂θ −1 G j=1 Ng i=1 ∂ log fij ∂θ ∂ log fij ∂θ N i=1 ∂2 log fi ∂θ ∂θ −1 where fi = f (yi|xi,θ) is the data density function, Ng denotes the number of elements in group (cluster) g, G is the number of groups (clusters), and θ denotes the vector of all unknown parameters. This approach has the advantage that no specific assumption about the form of intra-cluster dependence is necessary. Intuitively, one averages over the likelihood scores within each cluster instead of treating cluster elements as independent. In all goodness of fit (GoF) and hypothesis test statistics we use this “cluster-robust” variance in place of the “standard” formula. 2.6. Model comparison and selection NBM is nested within TPM and LCM, but TPM and LCM are non-nested. Therefore, we use three criteria to compare models in-sample, each of which is designed for the comparison of non-nested models. These include two traditional model selection criteria based on penalized likelihood, Akaike Information Criterion, (AIC), and Bayesian Information Criterion, (BIC), which are valid even in the presence of model misspecification (Sin and White, 1996). Models with smaller values of the AIC = − ln L + 2K and BIC = −2 ln L + K ln (N ), where ln L is the maximized log likelihood, K is the number of parameters in the model and N is the sample size, are preferred. We also use Andrew’s GoF test (Andrews, 1988a,b). The GoF test is based on a χ2 diagnostic statistic given by S = (f −fˆ) Σˆ −1 (f −fˆ) where f −fˆ is the N ×q matrix of differences between sample and fitted cell frequencies, q is the number of cells created by partitioning of the data, and Σˆ is its estimated covariance matrix. Under the null hypothesis of no misspecification the test has an asymptotic χ2(q −1) distribution (Andrews, 1988b). When the test statistic is formed using the maximum likelihood estimator, computation of the test statistic is simplified. Let A be the N × q matrix with ith row given by f i −fˆ i, let B be the N ×K matrix with ith row given by (∂/∂θ)log fi(yi|θ)
P. Deb, PK. Trivedi/Jourmal of Health Economics 21(2002)601-625 and let H=[A B]. The GoF=I'H(HH+H'l where l is a column vector of ones, i.e. GoF is NR2 from the regression of 1 on H(see Andrews, 1988a, Appendix 5, for details). Our implementation of the GoF adjusts for cluster effects by first summing the elements of H within clusters There is an important point of detail concerning our use of the above test. Because our likely to lead to the rejection of all models we will consider. This is a well-known difficulty of hypothesis testing with fixed significance levels in the classical framework. Moreover previous investigations of the properties of this test suggest that, at conventional critical values, it leads to overrejection of the true null(Deb and Trivedi, 1997; Cameron and Trivedi, 1998). However, it seems appropriate to rank models by the P-values associated with the test, the model with the largest P-value being preferred. Hence in addition to the formal x-test we use the size of the statistic informally as a measure of fit with a smaller statistic indicating better fit. Furthermore, we also graphically compare empirical and fitted cell probabilities 2.7. Cross-validation A common criticism of in-sample model selection methods is that they induce over-fitting in the case of complicated models. Consequently, the selected model may not be the best model. This bias can be avoided by treating one sample as a"training sample"used for estimation, and then using a second"hold-out" sample for forecast comparison sing parameter estimates from the training sample, we calculate three measures of performance for each model using the hold-out sample. The log-likelihood value is the most direct measure of the out-of-sample fit of the model. In order to continue to penalize models with large numbers of parameters, we also use the AlC. We do not use the BiC because it adds a penalty for the sample size in addition to a penalty for the number of parameters which is not appropriate in a cross-validation exercise. Finally, we use a modified version of the andrews statistic as a heuristic with the expectation that models with better fit will have smaller values of the modified GoF in the hold-out sample. We modify Eq. (2.12)to exactly maximized the likelihood function in the training sample MGoF= 1'A(A'A)+A'l where H is replaced by A, i.e. we assume that the parameters 3. Data and summary statistics e use data from the RhIe for this study. The experiment, conducted by the rand Corporation from 1974 to 1982, is the longest and largest controlled social experiment in medical care research. The main goal of the experiment was to assess how a patients use of health services is affected by types of health insurance, including both fee-for-service and health maintenance organizations(HMOs). In the rhie, data were collected from about 8000 enrollees in 2823 families, from six sites across the country. Each family was enrolled in one of fourteen different His insurance plans for either 3 or 5 years. The plans
P. Deb, P.K. Trivedi / Journal of Health Economics 21 (2002) 601–625 609 and let H = [A B]. Then GoF = 1 H(H H) +H 1 (2.12) where 1 is a column vector of ones, i.e. GoF is NR2 from the regression of 1 on H (see Andrews, 1988a, Appendix 5, for details). Our implementation of the GoF adjusts for cluster effects by first summing the elements of H within clusters. There is an important point of detail concerning our use of the above test. Because our sample size is quite large, a model comparison based on significance tests of fixed size is likely to lead to the rejection of all models we will consider. This is a well-known difficulty of hypothesis testing with fixed significance levels in the classical framework. Moreover, previous investigations of the properties of this test suggest that, at conventional critical values, it leads to overrejection of the true null (Deb and Trivedi, 1997; Cameron and Trivedi, 1998). However, it seems appropriate to rank models by the P-values associated with the test, the model with the largest P-value being preferred. Hence in addition to the formal χ2-test we use the size of the statistic informally as a measure of fit with a smaller statistic indicating better fit. Furthermore, we also graphically compare empirical and fitted cell probabilities. 2.7. Cross-validation A common criticism of in-sample model selection methods is that they induce over-fitting in the case of complicated models. Consequently, the selected model may not be the best model. This bias can be avoided by treating one sample as a “training sample” used for estimation, and then using a second “hold-out” sample for forecast comparison. Using parameter estimates from the training sample, we calculate three measures of performance for each model using the hold-out sample. The log-likelihood value is the most direct measure of the out-of-sample fit of the model. In order to continue to penalize models with large numbers of parameters, we also use the AIC. We do not use the BIC because it adds a penalty for the sample size in addition to a penalty for the number of parameters which is not appropriate in a cross-validation exercise. Finally, we use a modified version of the Andrews statistic as a heuristic with the expectation that models with better fit will have smaller values of the modified GoF in the hold-out sample. We modify Eq. (2.12) to MGoF = 1 A(A A)+A 1 where H is replaced by A, i.e. we assume that the parameters exactly maximized the likelihood function in the training sample. 3. Data and summary statistics We use data from the RHIE for this study. The experiment, conducted by the RAND Corporation from 1974 to 1982, is the longest and largest controlled social experiment in medical care research. The main goal of the experiment was to assess how a patient’s use of health services is affected by types of health insurance, including both fee-for-service and health maintenance organizations (HMOs). In the RHIE, data were collected from about 8000 enrollees in 2823 families, from six sites across the country. Each family was enrolled in one of fourteen different HIS insurance plans for either 3 or 5 years. The plans