问题描述
在我开发了用于样本内分析的小型 ARMAX 预测模型后,我想预测样本外的一些数据.
After i developed my little ARMAX-forecasting model for in-sample analysis i′d like to predict some data out of sample.
我用于预测计算的时间序列从 2013 年 1 月 1 日开始,到 2013 年 12 月 31 日结束!
The time series i use for forecasting calculation starts at 2013-01-01 and ends at 2013-12-31!
这是我正在处理的数据:
Here is my data I am working with:
hr = np.loadtxt("Data_2013_17.txt") index = date_range(start='2013-1-1', end='2013-12-31', freq='D') df = pd.DataFrame(hr, index=index) holidays = ['2013-1-1', '2013-3-29', '2013-4-1', '2013-5-1', '2013-5-9', '2013-5-20', '2013-10-3', '2013-12-25', '2013-12-26'] # holidays for all Bundesl?nder idx = df.asfreq('B').index - DatetimeIndex(holidays) indexed_df = df.reindex(idx) # indexed_df = df.asfreq('B') (includes holidays) # 'D'=day #'B'=business day # W@MON=shows only mondays # external variable hr_ = np.loadtxt("Data_2_2013.txt") index = date_range(start='2013-1-1', end='2013-12-31', freq='D') df = pd.DataFrame(hr_, index=index) idx2 = df.asfreq('B').index - DatetimeIndex(holidays) external_df1 = df.reindex(idx2) external_df = external_df1.fillna(external_df1.mean())
输出:
0 2013-01-02 49.56 2013-01-03 48.09 2013-01-04 36.79 2013-01-07 60.84 2013-01-08 59.72 2013-01-09 61.88 2013-01-10 57.95 2013-01-11 56.29 2013-01-14 57.89 2013-01-15 64.49 2013-01-16 58.92 2013-01-17 62.30 2013-01-18 55.92 2013-01-21 55.67 2013-01-22 60.73 2013-01-23 60.12 2013-01-24 65.70 2013-01-25 55.15 2013-01-28 51.79 2013-01-29 39.69 2013-01-30 37.90 2013-01-31 37.60 2013-02-01 41.26 2013-02-04 29.18 2013-02-05 39.55 2013-02-06 47.57 2013-02-07 51.97 2013-02-08 46.95 2013-02-11 42.79 2013-02-12 51.83 ... ... 2013-11-18 58.04 2013-11-19 62.96 2013-11-20 63.90 2013-11-21 64.09 2013-11-22 64.78 2013-11-25 59.59 2013-11-26 70.69 2013-11-27 61.57 2013-11-28 47.87 2013-11-29 34.61 2013-12-02 68.77 2013-12-03 77.84 2013-12-04 63.09 2013-12-05 40.94 2013-12-06 38.60 2013-12-09 65.79 2013-12-10 68.98 2013-12-11 77.86 2013-12-12 76.44 2013-12-13 85.90 2013-12-16 53.51 2013-12-17 73.67 2013-12-18 59.76 2013-12-19 53.11 2013-12-20 38.33 2013-12-23 36.93 2013-12-24 11.30 2013-12-27 30.32 2013-12-30 39.94 2013-12-31 31.27 [252 rows x 1 columns] 0 2013-01-02 70770 2013-01-03 74155 2013-01-04 74286 2013-01-07 75360 2013-01-08 76910 2013-01-09 78561 2013-01-10 77427 2013-01-11 75260 2013-01-14 78738 2013-01-15 78286 2013-01-16 79568 2013-01-17 79761 2013-01-18 77518 2013-01-21 80089 2013-01-22 79915 2013-01-23 78607 2013-01-24 79761 2013-01-25 77908 2013-01-28 79873 2013-01-29 80535 2013-01-30 76340 2013-01-31 78244 2013-02-01 77749 2013-02-04 79125 2013-02-05 79001 2013-02-06 77837 2013-02-07 77495 2013-02-08 75372 2013-02-11 73856 2013-02-12 77494 ... ... 2013-11-18 76292 2013-11-19 77420 2013-11-20 74993 2013-11-21 76658 2013-11-22 74769 2013-11-25 78347 2013-11-26 77756 2013-11-27 79648 2013-11-28 80075 2013-11-29 78587 2013-12-02 76867 2013-12-03 76070 2013-12-04 80344 2013-12-05 81736 2013-12-06 79617 2013-12-09 78085 2013-12-10 78430 2013-12-11 78120 2013-12-12 77735 2013-12-13 75872 2013-12-16 78651 2013-12-17 76180 2013-12-18 75867 2013-12-19 76018 2013-12-20 71101 2013-12-23 66841 2013-12-24 64557 2013-12-27 66747 2013-12-30 64787 2013-12-31 61101 [252 rows x 1 columns] Descriptive statistics of ts: 0 count 252.000000 mean 44.583651 std 11.708938 min 11.300000 25% 34.597500 50% 44.200000 75% 51.947500 max 85.900000 Skewness of endog_var: [ 0.44315988] Kurtsosis of endog_var: [ 3.18049689] Correlation hr & hr_: (0.71074420030220553, 2.0635001219278823e-57) Augmented Dickey-Fuller Test for endog_var: (-2.9282259926181839, 0.042162780619902182, {'5%': -2.8698573654386559, '1%': -3.4492269328800189, '10%': -2.5712010851306641}, <statsmodels.tsa.stattools.ResultsStore object at 0x111e2ca50>)
p和q值的选择:
在:arma_mod = sm.tsa.ARMA(indexed_df, (3,3), external_df).fit()z = arma_mod.params打印P 值和 Q 值:"打印z
In: arma_mod = sm.tsa.ARMA(indexed_df, (3,3), external_df).fit() z = arma_mod.params print 'P- and Q-Values:' print z
输出:
P- and Q-Values: const 19.674538 0 0.000345 ar.L1.0 -0.062796 ar.L2.0 0.340800 ar.L3.0 0.436345 ma.L1.0 0.613498 ma.L2.0 0.057267 ma.L3.0 -0.415455 dtype: float64 /Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)
这是我在样本外进行预测的方法:
Here′s what i do to forecast out of sample:
在:
start_pred = '2014-1-3' end_pred = '2014-1-3' predict_price1 = arma_mod1.predict(start_pred, end_pred, external_df)#, dynamic=True) print ('Predicted Price (ARMAX): {}' .format(predict_price1))
输出:
Traceback (most recent call last): File "<ipython-input-34-ad7feec95e4a>", line 6, in <module> predict_price1 = arma_mod1.predict(start_pred, end_pred, external_df)#, dynamic=True) File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/base/wrapper.py", line 92, in wrapper return data.wrap_output(func(results, *args, **kwargs), how) File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 1441, in predict return self.model.predict(self.params, start, end, exog, dynamic) File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 711, in predict start = self._get_predict_start(start, dynamic) File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 646, in _get_predict_start method) File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 376, in _validate start = _index_date(start, dates) File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/base/datetools.py", line 57, in _index_date "an integer" % date) ValueError: There is no frequency for these dates and date 2014-01-03 00:00:00 is not in dates index. Try giving a date that is in the dates index or use an integer
我不明白这个错误!
arima 源代码,即datetools.py"告诉我以下内容:
The arima source-code i.e. 'datetools.py' tells me the following:
except KeyError as err: freq = _infer_freq(dates) if freq is None: #TODO: try to intelligently roll forward onto a date in the # index. Waiting to drop pandas 0.7.x support so this is # cleaner to do. raise ValueError("There is no frequency for these dates and " "date %s is not in dates index. Try giving a " "date that is in the dates index or use " "an integer" % date) # we can start prediction at the end of endog if _idx_from_dates(dates[-1], date, freq) == 1: return len(dates) raise ValueError("date %s not in date index. Try giving a " "date that is in the dates index or use an integer" % date) def _date_from_idx(d1, idx, freq): """ Returns the date from an index beyond the end of a date series. d1 is the datetime of the last date in the series. idx is the index distance of how far the next date should be from d1. Ie., 1 gives the next date from d1 at freq. Notes ----- This does not do any rounding to make sure that d1 is actually on the offset. For now, this needs to be taken care of before you get here. """
这意味着应该可以在样本外进行预测.我只是不明白我需要在哪里以及如何更改我的对象?!
So that means that it should be possible to forecast out of sample. i just do not understand where and how i need to change my objects?!
我发现了一些较旧的帖子,但他们不会告诉我该怎么做:Python 出样预测 ARIMA predict()和 https://stats.stackexchange.com/questions/76160/im-not-sure-that-statsmodels-is-predicting-out-of-sample
I found some older posts but they wont tell me what to do neither: Python out of sample forecasting ARIMA predict() and https://stats.stackexchange.com/questions/76160/im-not-sure-that-statsmodels-is-predicting-out-of-sample
如何根据上述给定信息预测样本外的数据?
How to forecast data out of sample with the given information above?
帮助非常感谢
推荐答案
两个问题.如错误消息所示,2014-1-3"不在您的数据中.正如文档应该提到的那样,您需要在数据的一个时间步内开始预测.
Two problems. As the error message indicates, '2014-1-3' isn't in your data. You need to start the prediction within one time step of your data, as the docs should mention.
第二个问题,您的数据没有定义的频率.通过从工作日频率数据中删除假期,您将失去对第二天是什么的感觉.我们无法知道第二天应该是现在.您可以为 pandas 编写一个自定义日期偏移量,但这会有些工作.
Second problem, your data doesn't have a defined frequency. By removing the holidays from the business day frequency data, you lose any sense of what the next day is. There's no way for us to know what the next day is supposed to be now. You could code up a custom date offset for pandas, but that would be some work.
最简单的解决方法是使用 numpy 数组并删除 pandas DatetimeIndex.
Easiest workaround is just to use numpy arrays and drop the pandas DatetimeIndex.