Getting started

This very simple case-study is designed to get you up-and-running quickly with statsmodels. Starting from raw data, we will show the steps needed to estimate a statistical model and to draw a diagnostic plot. We will only use functions provided by statsmodels or its pandas and patsy dependencies.

Loading modules and functions

After installing statsmodels and its dependencies, we load a few modules and functions:

In [1]: from __future__ import print_function

In [2]: import statsmodels.api as sm

In [3]: import pandas

In [4]: from patsy import dmatrices

pandas builds on numpy arrays to provide rich data structures and data analysis tools. The pandas.DataFrame function provides labelled arrays of (potentially heterogenous) data, similar to the R “data.frame”. The pandas.read_csv function can be used to convert a comma-separated values file to a DataFrame object.

patsy is a Python library for describing statistical models and building Design Matrices using R-like formulas.

Data

We download the Guerry dataset, a collection of historical data used in support of Andre-Michel Guerry’s 1833 Essay on the Moral Statistics of France. The data set is hosted online in comma-separated values format (CSV) by the Rdatasets repository. We could download the file locally and then load it using read_csv, but pandas takes care of all of this automatically for us:

In [5]: df = sm.datasets.get_rdataset("Guerry", "HistData", cache=True).data
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1316                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error

/usr/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1228         """Send a complete request to the server."""
-> 1229         self._send_request(method, url, body, headers, encode_chunked)
   1230 

/usr/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1274             body = _encode(body, 'body')
-> 1275         self.endheaders(body, encode_chunked=encode_chunked)
   1276 

/usr/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
   1223             raise CannotSendHeader()
-> 1224         self._send_output(message_body, encode_chunked=encode_chunked)
   1225 

/usr/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
   1015         del self._buffer[:]
-> 1016         self.send(msg)
   1017 

/usr/lib/python3.7/http/client.py in send(self, data)
    955             if self.auto_open:
--> 956                 self.connect()
    957             else:

/usr/lib/python3.7/http/client.py in connect(self)
   1383 
-> 1384             super().connect()
   1385 

/usr/lib/python3.7/http/client.py in connect(self)
    927         self.sock = self._create_connection(
--> 928             (self.host,self.port), self.timeout, self.source_address)
    929         self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

/usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
    706     err = None
--> 707     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    708         af, socktype, proto, canonname, sa = res

/usr/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    747     addrlist = []
--> 748     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    749         af, socktype, proto, canonname, sa = res

gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-5-cab9fdf84142> in <module>()
----> 1 df = sm.datasets.get_rdataset("Guerry", "HistData", cache=True).data

/build/statsmodels-0.9.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
    289                      "master/doc/"+package+"/rst/")
    290     cache = _get_cache(cache)
--> 291     data, from_cache = _get_data(data_base_url, dataname, cache)
    292     data = read_csv(data, index_col=0)
    293     data = _maybe_reset_index(data)

/build/statsmodels-0.9.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
    220     url = base_url + (dataname + ".%s") % extension
    221     try:
--> 222         data, from_cache = _urlopen_cached(url, cache)
    223     except HTTPError as err:
    224         if '404' in str(err):

/build/statsmodels-0.9.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
    211     # not using the cache or didn't find it in cache
    212     if not from_cache:
--> 213         data = urlopen(url, timeout=3).read()
    214         if cache is not None:  # then put it in the cache
    215             _cache_it(data, cache_path)

/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    523             req = meth(req)
    524 
--> 525         response = self._open(req, data)
    526 
    527         # post-process response

/usr/lib/python3.7/urllib/request.py in _open(self, req, data)
    541         protocol = req.type
    542         result = self._call_chain(self.handle_open, protocol, protocol +
--> 543                                   '_open', req)
    544         if result:
    545             return result

/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

/usr/lib/python3.7/urllib/request.py in https_open(self, req)
   1358         def https_open(self, req):
   1359             return self.do_open(http.client.HTTPSConnection, req,
-> 1360                 context=self._context, check_hostname=self._check_hostname)
   1361 
   1362         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error
-> 1319                 raise URLError(err)
   1320             r = h.getresponse()
   1321         except:

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

The Input/Output doc page shows how to import from various other formats.

We select the variables of interest and look at the bottom 5 rows:

In [6]: vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']

In [7]: df = df[vars]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-817b52d314c7> in <module>()
----> 1 df = df[vars]

NameError: name 'df' is not defined

In [8]: df[-5:]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-6b38bebfca7f> in <module>()
----> 1 df[-5:]

NameError: name 'df' is not defined

Notice that there is one missing observation in the Region column. We eliminate it using a DataFrame method provided by pandas:

In [9]: df = df.dropna()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-1ac174be4d67> in <module>()
----> 1 df = df.dropna()

NameError: name 'df' is not defined

In [10]: df[-5:]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-6b38bebfca7f> in <module>()
----> 1 df[-5:]

NameError: name 'df' is not defined

Substantive motivation and model

We want to know whether literacy rates in the 86 French departments are associated with per capita wagers on the Royal Lottery in the 1820s. We need to control for the level of wealth in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects. The model is estimated using ordinary least squares regression (OLS).

Design matrices (endog & exog)

To fit most of the models covered by statsmodels, you will need to create two design matrices. The first is a matrix of endogenous variable(s) (i.e. dependent, response, regressand, etc.). The second is a matrix of exogenous variable(s) (i.e. independent, predictor, regressor, etc.). The OLS coefficient estimates are calculated as usual:

\[\hat{\beta} = (X'X)^{-1} X'y\]

where \(y\) is an \(N \times 1\) column of data on lottery wagers per capita (Lottery). \(X\) is \(N \times 7\) with an intercept, the Literacy and Wealth variables, and 4 region binary variables.

The patsy module provides a convenient function to prepare design matrices using R-like formulas. You can find more information here.

We use patsy’s dmatrices function to create design matrices:

In [11]: y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-e357b4316ca3> in <module>()
----> 1 y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')

NameError: name 'df' is not defined

The resulting matrices/data frames look like this:

In [12]: y[:3]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-941693b55cb6> in <module>()
----> 1 y[:3]

NameError: name 'y' is not defined

In [13]: X[:3]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-153642677bb3> in <module>()
----> 1 X[:3]

NameError: name 'X' is not defined

Notice that dmatrices has

  • split the categorical Region variable into a set of indicator variables.
  • added a constant to the exogenous regressors matrix.
  • returned pandas DataFrames instead of simple numpy arrays. This is useful because DataFrames allow statsmodels to carry-over meta-data (e.g. variable names) when reporting results.

The above behavior can of course be altered. See the patsy doc pages.

Model fit and summary

Fitting a model in statsmodels typically involves 3 easy steps:

  1. Use the model class to describe the model
  2. Fit the model using a class method
  3. Inspect the results using a summary method

For OLS, this is achieved by:

In [14]: mod = sm.OLS(y, X)    # Describe model
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-5408baa0a82d> in <module>()
----> 1 mod = sm.OLS(y, X)    # Describe model

NameError: name 'y' is not defined

In [15]: res = mod.fit()       # Fit model
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-7e8bcaf10af6> in <module>()
----> 1 res = mod.fit()       # Fit model

NameError: name 'mod' is not defined

In [16]: print(res.summary())   # Summarize model
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-963a1fa5b90f> in <module>()
----> 1 print(res.summary())   # Summarize model

NameError: name 'res' is not defined

The res object has many useful attributes. For example, we can extract parameter estimates and r-squared by typing:

In [17]: res.params
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-a6ed515e2925> in <module>()
----> 1 res.params

NameError: name 'res' is not defined

In [18]: res.rsquared
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-6c0195efc793> in <module>()
----> 1 res.rsquared

NameError: name 'res' is not defined

Type dir(res) for a full list of attributes.

For more information and examples, see the Regression doc page

Diagnostics and specification tests

statsmodels allows you to conduct a range of useful regression diagnostics and specification tests. For instance, apply the Rainbow test for linearity (the null hypothesis is that the relationship is properly modelled as linear):

In [19]: sm.stats.linear_rainbow(res)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-052802ccfec2> in <module>()
----> 1 sm.stats.linear_rainbow(res)

NameError: name 'res' is not defined

Admittedly, the output produced above is not very verbose, but we know from reading the docstring (also, print(sm.stats.linear_rainbow.__doc__)) that the first number is an F-statistic and that the second is the p-value.

statsmodels also provides graphics functions. For example, we can draw a plot of partial regression for a set of regressors by:

In [20]: sm.graphics.plot_partregress('Lottery', 'Wealth', ['Region', 'Literacy'],
   ....:                              data=df, obs_labels=False)
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-ebbe148b620d> in <module>()
      1 sm.graphics.plot_partregress('Lottery', 'Wealth', ['Region', 'Literacy'],
----> 2                              data=df, obs_labels=False)

NameError: name 'df' is not defined
_images/gettingstarted_0.png

More

Congratulations! You’re ready to move on to other topics in the Table of Contents