问题描述
我有一个多年的时间序列,需要我 95% 的数据所在的界限.我想按一年中的季节('DJF'、'MAM'、'**A'、'SON')来查看这个.
I have a multi-year time series an want the bounds between which 95% of my data lie. I want to look at this by season of the year ('DJF', 'MAM', '**A', 'SON').
我尝试了以下方法:
import pandas as pd import numpy as np FRAC_2_TAIL = 0.025 yr_idx = pd.date_range(start='2005-01-30', end='2008-02-02', freq='D') data = np.random.rand(len(yr_idx)) df = pd.DataFrame(index=yr_idx, data=data, columns=['a']) month_num_to_season = { 1:'DJF', 2:'DJF', 3:'MAM', 4:'MAM', 5:'MAM', 6:'**A', 7:'**A', 8:'**A', 9:'SON', 10:'SON', 11:'SON', 12:'DJF'} grouped = df.groupby(lambda x: month_num_to_season.get(x.month)) low_bounds = grouped.quantile(FRAC_2_TAIL) high_bounds = grouped.quantile(1 - FRAC_2_TAIL)
它在给予的意义上起作用:
it works in the sense of giving:
DJF 0.021284 **A 0.024769 MAM 0.030149 SON 0.041784
但我的每分钟频率、十年之久的数据集需要很长时间.
but takes a very long time on my minutely frequency, decade long, data sets.
我可以利用 TimeGrouper 来获得几乎我想要的东西:
I can make use of a TimeGrouper to get almost what I want:
gp_time = df.groupby(pd.TimeGrouper('QS-DEC')) low_bounds = gp_time.agg(lambda x: x.quantile(FRAC_2_TAIL))
但我们每年都有单独的输出(多年来没有明显的方法来组合分位数限制).
but we have separate output for each year (with no obvious way to combine quantile limits over the years).
2004-12-01 0.036755 2005-03-01 0.034271 ... 2007-09-01 0.098833 2007-12-01 0.068948
我还尝试制作 freq='QS-DEC' 时间序列 'DJF'、'MAM' 等以最小化字典查找,然后上采样到 df.index.freq 并对其进行分组.它既慢又占用内存.
I've also tried making a freq='QS-DEC' time-series 'DJF', 'MAM' etc. to minimize the dictionary lookups, then upsampling to df.index.freq and grouping on that. It is slow and memory-heavy too.
我好像遗漏了一些明显的东西.
It seems like I'm missing something obvious.
根据@JohnE 的评论
in light of @JohnE's comment
需要时间的是 groupby 中的 dict 查找.使用 5 年的详细数据:
It is the dict lookup in the groupby that is taking time. Using 5 years of minutely data:
%%timeit grouped = df.groupby(lambda x: month_num_to_season.get(x.month)) > 13.3 s per loop
分位数计算速度很快:
%%timeit low_bounds = grouped.quantile(FRAC_2_TAIL) > 2.94 ms per loop
添加季节列并对其进行分组在总体时间上是相似的.再次由 dict 查找`控制:
Adding a season column and grouping on that is similar in overall timing. Again dominated by the dict lookup`:
SEAS = 'season' %%timeit df[SEAS] = [month_num_to_season.get(t_stamp.month) for t_stamp in df.index] > 13.1 s per loop %%timeit gp_on_col = df.groupby(SEAS) > 10000 loops, best of 3: 62.7 μs per loop %%timeit gp_on_col.quantile(FRAC_2_TAIL) > 753 ms per loop
我重新实现了制作季度数据框的方法,以最小化 dict 查找然后对其进行上采样.这种方法现在看起来像是一个实质性的改进:我不知道我以前是怎么让它变得这么慢的:
I re-implemented the method of making a quarterly season dataframe to minimize the dict lookups then up-sampling that. This method is now looking like a substantial improvement: I do not know how I had made it so slow before:
SEASON_HALO = pd.datetools.relativedelta(months=4) start_with_halo = df.index.min() - SEASON_HALO end_with_halo = df.index.max() + SEASON_HALO > 84.1 μs per loop seasonal_idx = pd.DatetimeIndex(start=start_with_halo, end=end_with_halo, freq='QS-DEC') seasonal_ts = pd.DataFrame(index=seasonal_idx) > 440 μs per loop seasonal_ts[SEAS] = [month_num_to_season.get(t_stamp.month) for t_stamp in seasonal_ts.index] > 1.25 s per loop seasonal_minutely_ts = seasonal_ts.resample(df.index.freq, fill_method='ffill') > 5.12 ms per loop df_via_resample = df.join(seasonal_minutely_ts) > 47 ms per loop gp_up_sample = df_via_resample.groupby(SEAS) > 63.4 μs per loop gp_up_sample.quantile(FRAC_2_TAIL) > 834 ms per loop
这相当于 2 秒与其他方法的 13 秒.
That is something like 2 sec vs 13 sec for the other methods.
推荐答案
如果有帮助,我建议替换以下您认为很慢的列表理解和字典查找:
In case it helps, I would suggest replacing the following list comprehension and dict lookup that you identified as slow:
month_to_season_dct = { 1: 'DJF', 2: 'DJF', 3: 'MAM', 4: 'MAM', 5: 'MAM', 6: '**A', 7: '**A', 8: '**A', 9: 'SON', 10: 'SON', 11: 'SON', 12: 'DJF' } grp_ary = [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]
如下,它使用一个 numpy 数组作为查找表.
with the following, which uses a numpy array as a lookup table.
month_to_season_lu = np.array([ None, 'DJF', 'DJF', 'MAM', 'MAM', 'MAM', '**A', '**A', '**A', 'SON', 'SON', 'SON', 'DJF' ]) grp_ary = month_to_season_lu[df.index.month]
以下是两种方法在大约 3 年的分钟数据上的时间比较:
Here's a timeit comparison of the two approaches on ~3 years of minutely data:
In [16]: timeit [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index] 1 loops, best of 3: 12.3 s per loop In [17]: timeit month_to_season_lu[df.index.month] 1 loops, best of 3: 549 ms per loop