Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     843.6
Date:                Mon, 29 Apr 2019   Prob (F-statistic):           3.35e-40
Time:                        18:51:16   Log-Likelihood:                -3.6548
No. Observations:                  50   AIC:                             15.31
Df Residuals:                      46   BIC:                             22.96
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.7733      0.093     51.600      0.000       4.587       4.960
x1             0.5441      0.014     38.136      0.000       0.515       0.573
x2             0.4882      0.056      8.705      0.000       0.375       0.601
x3            -0.0235      0.001    -18.728      0.000      -0.026      -0.021
==============================================================================
Omnibus:                        1.495   Durbin-Watson:                   2.599
Prob(Omnibus):                  0.473   Jarque-Bera (JB):                1.409
Skew:                           0.388   Prob(JB):                        0.494
Kurtosis:                       2.730   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.18682316  4.69452619  5.1625743   5.56436005  5.8828785   6.11352104
  6.26483261  6.35710774  6.41905627  6.48308635  6.57997992  6.73383529
  6.9581079   7.25339955  7.60735963  7.99671419  8.39108986  8.75800407
  9.06820185  9.30046234  9.44508533  9.50548496  9.49762878  9.44741437
  9.38641438  9.34668927  9.35551988  9.43092523  9.57870205  9.79147243
 10.04989594 10.32584727 10.58703705 10.802317   10.94679744 11.0059366
 10.97792982 10.87400604 10.71658289 10.53558357 10.36352055 10.23015384
 10.15760151 10.1567058  10.22525169 10.34833089 10.5007912  10.65136895
 10.76782552 10.82224409]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.77748156 10.60183559 10.31445222  9.95896286  9.59280181  9.27314433
  9.04290813  8.92024564  8.89409966  8.92691051]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.773309
x1                  0.544079
np.sin(x1)          0.488218
I((x1 - 5) ** 2)   -0.023459
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.777482
1    10.601836
2    10.314452
3     9.958963
4     9.592802
5     9.273144
6     9.042908
7     8.920246
8     8.894100
9     8.926911
dtype: float64