Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models
5 Replies
Ecologists commonly collect data representing counts of organisms. Generalized linear models (GLMs) provide a powerful tool for analyzing count data. [As mentioned previously, you should generally not transform your data to fit a linear model and, particularly, do not log-transform count data.] The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Thus, we need to test if the variance>mean or if the number of zeros is greater than expected. Below, we will walk through the basic steps to determine which GLM to use to analyze your data.
First, we will define a few of the variables that are used repeatedly throughout the subsequent code. [Note: click here to get all code from this post in a single file.] We are using an unrealistic sample size for most ecological studies because we do not want to be misled by a strange draw from any of the distributions.
Poisson model
Now, we run the GLM and set the error distribution to Poisson,
which produces the following output.
Poisson model
We first test if a Poisson model fits this data. Remember that we are trying to simulate the steps you would take if you did not know the properties of the distribution (beyond knowing that you have integers bound at zero and infinity).
Negative binomial model
Poisson model
We first fit the Poisson model.
Negative binomial model
Next, we fit a negative binomial model.
Zero-inflated Poisson models
We load the 'pscl' package to fit the zero-inflated model. First, we fit a model where we assume that the probability of zero is the same for both treatments (with '~ Trt|1').
If we fit a zero-inflated model to test a treatment effect for both the counts and the zeros (with '~ Trt|Trt'),
We can test for overdispersion in the count part of the zero-inflated model by specifying a negative binomial distribution.
source: http://datavoreconsulting.com/programming-tips/count-data-glms-choosing-poisson-negative-binomial-zero-inflated-poisson/
First, we will define a few of the variables that are used repeatedly throughout the subsequent code. [Note: click here to get all code from this post in a single file.] We are using an unrealistic sample size for most ecological studies because we do not want to be misled by a strange draw from any of the distributions.
Poisson data
We generate random variates from a Poisson distribution with the rpois( ) function. Because mean=variance in a Poisson distribution, we only need to specify the number of random variates and the expected mean of the distribution.Poisson model
Now, we run the GLM and set the error distribution to Poisson,
Call: glm(formula = Response ~ Trt, family = poisson, data = data.pois) Deviance Residuals: Min 1Q Median 3Q Max -3.1000 -0.6647 0.1045 0.6032 2.2611 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.30558 0.03158 73.02 < 2e-16 *** TrtB -0.74323 0.05562 -13.36 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 374.94 on 199 degrees of freedom Residual deviance: 183.85 on 198 degrees of freedom AIC: 934.08 Number of Fisher Scoring iterations: 4We test for goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom.
[1] 0.7565471The GOF test indicates that the Poisson model fits the data (p > 0.05). If this were your actual data, you could breathe a sigh of relief because you could stop here. Well, not quite here. You will still want to use the model to predict mean counts for each treatment and standard errors for each parameter.
Trt Mean SE 1 A 10.03 0.3167015 2 B 4.77 0.2184031Because we used a large sample size, the predicted means are similar to the expected means of 10 and 5.
Negative binomial data
Next we will use the 'MASS' package to generate random deviates from a negative binomial distribution, which involves a parameter, theta, that controls the variance of the distribution.Poisson model
We first test if a Poisson model fits this data. Remember that we are trying to simulate the steps you would take if you did not know the properties of the distribution (beyond knowing that you have integers bound at zero and infinity).
Call: glm(formula = Response ~ Trt, family = poisson, data = data.nb) Deviance Residuals: Min 1Q Median 3Q Max -3.7258 -1.4732 -0.0846 1.1024 4.5505 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.32923 0.03120 74.64 < 2e-16 *** TrtB -0.70589 0.05428 -13.01 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 740.63 on 199 degrees of freedom Residual deviance: 560.82 on 198 degrees of freedom AIC: 1283.8 Number of Fisher Scoring iterations: 5
[1] 0As expected, the Poisson model does not fit the data (p < 0.05), but we will still take a look at the predicted values.
Trt Mean SE 1 A 10.27 0.3204684 2 B 5.07 0.2251666Make a note of the SEs in this output because we will come back to this after we run a GLM based on a negative binomial error distribution.
Negative binomial model
Call: glm.nb(formula = Response ~ Trt, data = data.nb, init.theta = 4.114111123, link = log) Deviance Residuals: Min 1Q Median 3Q Max -2.5705 -0.8806 -0.0454 0.5676 2.2889 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.32923 0.05835 39.920 < 2e-16 *** TrtB -0.70589 0.08836 -7.989 1.36e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(4.1141) family taken to be 1) Null deviance: 280.56 on 199 degrees of freedom Residual deviance: 216.42 on 198 degrees of freedom AIC: 1136.8 Number of Fisher Scoring iterations: 1 Theta: 4.114 Std. Err.: 0.668 2 x log-likelihood: -1130.825The model estimates the dispersion parameter as 4.114; we set theta = 5 when generating random variates.
[1] 0.1756968The GOF test indicates that the negative binomial model fits the data.
Trt Mean SE 1 A 10.27 0.5992233 2 B 5.07 0.3364221Here you see the 'danger' of ignoring overdispersion in the Poisson model. The SE estimates are lower for the Poisson model than for the negative binomial model, which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.
Zero-inflated Poisson data
Lastly, we will add more more layer of complication to the story. If you have lots of zeros in your data, and have determined that Poisson and negative binomial models do not fit your data well, then you should turn to zero-inflated models with either Poisson or negative binomial error distributions. We need the 'VGAM' package to generate random variates from a zero-inflated Poisson distribution using the rzipois( ) function. The 3rd argument to the rzipois( ) function specifies the probability of drawing a zero beyond the expected number of zeros for a Poisson distribution with the specified mean. Here were are introducing a relatively small proportion of extra zeros and the same proportion for each treatment.Poisson model
We first fit the Poisson model.
Call: glm(formula = Response ~ Trt, family = poisson, data = data.zip) Deviance Residuals: Min 1Q Median 3Q Max -3.9166 -1.8131 0.1183 1.3324 2.8984 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.03732 0.03611 56.42 < 2e-16 *** TrtB -0.64107 0.06147 -10.43 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 869.33 on 199 degrees of freedom Residual deviance: 754.93 on 198 degrees of freedom AIC: 1337.3 Number of Fisher Scoring iterations: 5
[1] 0The Poisson model does not fit the data.
Trt Mean SE 1 A 7.67 0.2769476 2 B 4.04 0.2009973The Poisson model also does not predict the correct mean counts.
Negative binomial model
Next, we fit a negative binomial model.
Call: glm.nb(formula = Response ~ Trt, data = data.zip, init.theta = 1.47231218, link = log) Deviance Residuals: Min 1Q Median 3Q Max -2.3189 -0.9259 0.0472 0.5432 1.2916 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.03732 0.08998 22.643 < 2e-16 *** TrtB -0.64107 0.13177 -4.865 1.14e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(1.4723) family taken to be 1) Null deviance: 280.22 on 199 degrees of freedom Residual deviance: 256.67 on 198 degrees of freedom AIC: 1122 Number of Fisher Scoring iterations: 1 Theta: 1.472 Std. Err.: 0.234 2 x log-likelihood: -1116.006
[1] 0.003149441The negative binomial model does not fit the data.
Trt Mean SE 1 A 7.67 0.6901218 2 B 4.04 0.3889176And does not predict the correct means.
Zero-inflated Poisson models
We load the 'pscl' package to fit the zero-inflated model. First, we fit a model where we assume that the probability of zero is the same for both treatments (with '~ Trt|1').
Call: zeroinfl(formula = Response ~ Trt | 1, data = data.zip) Pearson residuals: Min 1Q Median 3Q Max -1.5073 -0.9286 0.1330 0.8941 2.1819 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.26040 0.03612 62.584 < 2e-16 *** TrtB -0.55430 0.06203 -8.937 < 2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1903 0.1681 -7.082 1.42e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 13 Log-likelihood: -477 on 3 DfAs expected, the model output indicates that there are significantly more zeros than expected for a Poisson distribution.
Trt Count Zero 1 A 9.586959 0.2332005 2 B 5.507434 0.2332005The zero-inflated model predicts the correct mean counts and probability of zero.
If we fit a zero-inflated model to test a treatment effect for both the counts and the zeros (with '~ Trt|Trt'),
Call: zeroinfl(formula = Response ~ Trt | Trt, data = data.zip) Pearson residuals: Min 1Q Median 3Q Max -1.62159 -0.96198 0.06977 0.91545 2.20244 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.26039 0.03612 62.580 < 2e-16 *** TrtB -0.55348 0.06194 -8.936 < 2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.3866 0.2501 -5.545 2.94e-08 *** TrtB 0.3769 0.3383 1.114 0.265 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 16 Log-likelihood: -476.4 on 4 Dfwe see that there are significantly more zeros than expected, but that the probability of zero is not significantly different between the two treatments.
Trt Count Zero 1 A 9.586842 0.1999451 2 B 5.511897 0.2670400Zero-inflated negative binomial model
We can test for overdispersion in the count part of the zero-inflated model by specifying a negative binomial distribution.
Call: zeroinfl(formula = Response ~ Trt | 1, data = data.zip, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -1.5072 -0.9285 0.1330 0.8940 2.1818 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.26040 0.03612 62.574 < 2e-16 *** TrtB -0.55431 0.06203 -8.935 < 2e-16 *** Log(theta) 10.26954 66.78568 0.154 0.878 Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1903 0.1681 -7.082 1.42e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 28840.7026 Number of iterations in BFGS optimization: 32 Log-likelihood: -477 on 4 DfThe estimated theta parameter is not significant indicating that the zero-inflated Poisson model is appropriate.
source: http://datavoreconsulting.com/programming-tips/count-data-glms-choosing-poisson-negative-binomial-zero-inflated-poisson/