Logo

Generalized Linear Models

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: from scipy import stats

In [4]: from matplotlib import pyplot as plt

GLM: Binomial response data

Load data

In this example, we use the Star98 dataset which was taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach. Codebook information can be obtained by typing:

In [5]: print sm.datasets.star98.NOTE

Number of Observations - 303 (counties in California).

Number of Variables - 13 and 8 interaction terms.

Definition of variables names::

    NABOVE   - Total number of students above the national median for the math
               section.
    NBELOW   - Total number of students below the national median for the math
               section.
    LOWINC   - Percentage of low income students
    PERASIAN - Percentage of Asian student
    PERBLACK - Percentage of black students
    PERHISP  - Percentage of Hispanic students
    PERMINTE - Percentage of minority teachers
    AVYRSEXP - Sum of teachers' years in educational service divided by the
               number of teachers.
    AVSALK   - Total salary budget including benefits divided by the number of
               full-time teachers (in thousands)
    PERSPENK - Per-pupil spending (in thousands)
    PTRATIO  - Pupil-teacher ratio.
    PCTAF    - Percentage of students taking UC/CSU prep courses
    PCTCHRT  - Percentage of charter schools
    PCTYRRND - Percentage of year-round schools

    The below variables are interaction terms of the variables defined above.

    PERMINTE_AVYRSEXP
    PEMINTE_AVSAL
    AVYRSEXP_AVSAL
    PERSPEN_PTRATIO
    PERSPEN_PCTAF
    PTRATIO_PCTAF
    PERMINTE_AVTRSEXP_AVSAL
    PERSPEN_PTRATIO_PCTAF

Load the data and add a constant to the exogenous (independent) variables:

In [6]: data = sm.datasets.star98.load()

In [7]: data.exog = sm.add_constant(data.exog, prepend=False)

The dependent variable is N by 2 (Success: NABOVE, Failure: NBELOW):

In [8]: print data.endog[:5, :]
[[ 452.  355.]
 [ 144.   40.]
 [ 337.  234.]
 [ 395.  178.]
 [   8.   57.]]

The independent variables include all the other variables described above, as well as the interaction terms:

In [9]: print data.exog[:2, :]
[[    34.3973     23.2993     14.2353     11.4111     15.9184     14.7065
      59.1573      4.4452     21.7102     57.0328      0.         22.2222
     234.1029    941.6881    869.9948     96.5066    253.5224   1238.1955
   13848.8985   5504.0352      1.    ]
 [    17.3651     29.3284      8.2349      9.3149     13.6364     16.0832
      59.504       5.2676     20.4428     64.6226      0.          0.
     219.3169    811.4176    957.0166    107.6843    340.4061   1321.0664
   13050.2233   6958.8468      1.    ]]

Fit and summary

In [10]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

In [11]: res = glm_binom.fit()

In [12]: print res.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           ['y1', 'y2']   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2998.6
Date:                Thu, 24 Oct 2013   Deviance:                       4078.8
Time:                        04:01:36   Pearson chi2:                 4.05e+03
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0168      0.000    -38.749      0.000        -0.018    -0.016
x2             0.0099      0.001     16.505      0.000         0.009     0.011
x3            -0.0187      0.001    -25.182      0.000        -0.020    -0.017
x4            -0.0142      0.000    -32.818      0.000        -0.015    -0.013
x5             0.2545      0.030      8.498      0.000         0.196     0.313
x6             0.2407      0.057      4.212      0.000         0.129     0.353
x7             0.0804      0.014      5.775      0.000         0.053     0.108
x8            -1.9522      0.317     -6.162      0.000        -2.573    -1.331
x9            -0.3341      0.061     -5.453      0.000        -0.454    -0.214
x10           -0.1690      0.033     -5.169      0.000        -0.233    -0.105
x11            0.0049      0.001      3.921      0.000         0.002     0.007
x12           -0.0036      0.000    -15.878      0.000        -0.004    -0.003
x13           -0.0141      0.002     -7.391      0.000        -0.018    -0.010
x14           -0.0040      0.000     -8.450      0.000        -0.005    -0.003
x15           -0.0039      0.001     -4.059      0.000        -0.006    -0.002
x16            0.0917      0.015      6.321      0.000         0.063     0.120
x17            0.0490      0.007      6.574      0.000         0.034     0.064
x18            0.0080      0.001      5.362      0.000         0.005     0.011
x19            0.0002   2.99e-05      7.428      0.000         0.000     0.000
x20           -0.0022      0.000     -6.445      0.000        -0.003    -0.002
const          2.9589      1.547      1.913      0.056        -0.073     5.990
==============================================================================

Quantities of interest

Total number of trials:

In [13]: print data.endog[0].sum()
807.0

Parameter estimates:

In [14]: print res.params
[-0.0168  0.0099 -0.0187 -0.0142  0.2545  0.2407  0.0804 -1.9522 -0.3341
 -0.169   0.0049 -0.0036 -0.0141 -0.004  -0.0039  0.0917  0.049   0.008
  0.0002 -0.0022  2.9589]

The corresponding t-values:

In [15]: print res.tvalues
[-38.7491  16.5047 -25.1822 -32.8179   8.4983   4.2125   5.775   -6.1619
  -5.4532  -5.1687   3.9212 -15.8783  -7.3909  -8.4496  -4.0592   6.3211
   6.5743   5.3623   7.4281  -6.4451   1.913 ]

First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables:

In [16]: means = data.exog.mean(axis=0)

In [17]: means25 = means.copy()

In [18]: means25[0] = stats.scoreatpercentile(data.exog[:, 0], 25)

In [19]: means75 = means.copy()

In [20]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:, 0], 75)

In [21]: resp_25 = res.predict(means25)

In [22]: resp_75 = res.predict(means75)

In [23]: diff = resp_75 - resp_25

The interquartile first difference for the percentage of low income households in a school district is:

In [24]: print '%2.4f%%' % (diff * 100)
-11.8753%

Plots

We extract information that will be used to draw some interesting plots:

In [25]: nobs = res.nobs

In [26]: y = data.endog[:, 0] / data.endog.sum(1)

In [27]: yhat = res.mu

Plot yhat vs y:

In [28]: plt.figure();

In [29]: plt.scatter(yhat, y);

In [30]: line_fit = sm.OLS(y, sm.add_constant(yhat, prepend=False)).fit().params

In [31]: fit = lambda x: line_fit[1] + line_fit[0] * x  # better way in scipy?

In [32]: plt.plot(np.linspace(0, 1, nobs), fit(np.linspace(0, 1, nobs)));

In [33]: plt.title('Model Fit Plot');

In [34]: plt.ylabel('Observed values');

In [35]: plt.xlabel('Fitted values');
../../_images/glm_fitted.png

Plot yhat vs. Pearson residuals:

In [36]: plt.figure();

In [37]: plt.scatter(yhat, res.resid_pearson);

In [38]: plt.plot([0.0, 1.0], [0.0, 0.0], 'k-');

In [39]: plt.title('Residual Dependence Plot');

In [40]: plt.ylabel('Pearson Residuals');

In [41]: plt.xlabel('Fitted values');
../../_images/glm_resids.png

Histogram of standardized deviance residuals

In [42]: plt.figure();

In [43]: resid = res.resid_deviance.copy()

In [44]: resid_std = (resid - resid.mean()) / resid.std()

In [45]: plt.hist(resid_std, bins=25);

In [46]: plt.title('Histogram of standardized deviance residuals');
../../_images/glm_hist_res.png

QQ Plot of Deviance Residuals

In [47]: from statsmodels import graphics

In [48]: graphics.gofplots.qqplot(resid, line='r');
../../_images/glm_qqplot.png

GLM: Gamma for proportional count response

Load data

In the example above, we printed the NOTE attribute to learn about the Star98 dataset. Statsmodels datasets ships with other useful information. For example:

In [49]: print sm.datasets.scotland.DESCRLONG

This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts.  This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.

The original source files and variable information are included in
/scotland/src/

Load the data and add a constant to the exogenous variables:

In [50]: data2 = sm.datasets.scotland.load()

In [51]: data2.exog = sm.add_constant(data2.exog, prepend=False)

In [52]: print data2.exog[:5, :]
[[   712.      21.     105.      82.4  13566.      12.3  14952.       1. ]
 [   643.      26.5     97.      80.2  13566.      15.3  17039.5      1. ]
 [   679.      28.3    113.      86.3   9611.      13.9  19215.7      1. ]
 [   801.      27.1    109.      80.4   9483.      13.6  21707.1      1. ]
 [   753.      22.     115.      64.7   9265.      14.6  16566.       1. ]]

In [53]: print data2.endog[:5]
[ 60.3  52.3  53.4  57.   68.7]

Fit and summary

In [54]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())

In [55]: glm_results = glm_gamma.fit()

In [56]: print glm_results.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   32
Model:                            GLM   Df Residuals:                       24
Model Family:                   Gamma   Df Model:                            7
Link Function:          inverse_power   Scale:                0.00358428317349
Method:                          IRLS   Log-Likelihood:                -83.017
Date:                Thu, 24 Oct 2013   Deviance:                     0.087389
Time:                        04:01:37   Pearson chi2:                   0.0860
No. Iterations:                     5                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1          4.962e-05   1.62e-05      3.060      0.002      1.78e-05  8.14e-05
x2             0.0020      0.001      3.824      0.000         0.001     0.003
x3         -7.181e-05   2.71e-05     -2.648      0.008        -0.000 -1.87e-05
x4             0.0001   4.06e-05      2.757      0.006      3.23e-05     0.000
x5         -1.468e-07   1.24e-07     -1.187      0.235     -3.89e-07  9.56e-08
x6            -0.0005      0.000     -2.159      0.031        -0.001 -4.78e-05
x7         -2.427e-06   7.46e-07     -3.253      0.001     -3.89e-06 -9.65e-07
const         -0.0178      0.011     -1.548      0.122        -0.040     0.005
==============================================================================