用相邻日期的平均数据填补数据空白

问题描述

Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN -- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean?)

I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"]), but I feel there must be a more pythonic/pandas-y way?

Sample data set:

import numpy as np
import pandas as pd

# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))

# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan

print df["2014-01-04 08:00":"2014-01-04 16:00"]

                            A         B         C
2014-01-04 08:00:00  0.675720  2.186484 -0.033969
2014-01-04 08:30:00 -0.897217  1.332437 -2.618197
2014-01-04 09:00:00  0.299395  0.837023  1.346117
2014-01-04 09:30:00  0.223051  0.913047       NaN
2014-01-04 10:00:00       NaN  1.395480       NaN
2014-01-04 10:30:00       NaN -0.800921       NaN
2014-01-04 11:00:00       NaN -0.932760       NaN
2014-01-04 11:30:00  0.057219 -0.071280       NaN
2014-01-04 12:00:00  0.215810 -1.099531       NaN
2014-01-04 12:30:00 -0.532563       NaN       NaN
2014-01-04 13:00:00 -0.697872       NaN       NaN
2014-01-04 13:30:00 -0.028541       NaN       NaN
2014-01-04 14:00:00 -0.073426       NaN       NaN
2014-01-04 14:30:00 -1.187419  0.221636       NaN
2014-01-04 15:00:00  1.802449  0.144715       NaN
2014-01-04 15:30:00  0.446615  1.013915 -1.813272
2014-01-04 16:00:00 -0.410670  1.265309 -0.198607

[17 rows x 3 columns]

(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )

/edit: A sample solution that I'm not really happy with:

# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
    for idx, rows in df.iterrows():
        if np.isnan(df.ix[idx, col]):
            # replace with mean of adjacent days
            df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col])

There is two things I don't like about this solution:

If there is a single line missing or duplicated anywhere, this fails. In the last line, I would like to subtract "one day" all the time, no matter if that is 47, 48 or 49 rows away. Also, it would be good of I could extend the range (e.g. -3 days to +3 days) without manually writing a list for the index.
I would like to get rid of the loops, if that is possible.

解决方案

This should be a faster and more concise way to do it. Main thing is to use the shift() function instead of the loop. Simple version would be this:

df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )

It turned out to be really hard to generalize this, but this seems to work:

df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in 
                                     range(-48*window,48*(window+1),48) ], axis=0 )

I'm not sure, but suspect there might be a bug with nanmean and it's also the same reason you got missing values yourself. It seems to me that nanmean cannot handle nans if you feed it a dataframe. But if I convert to an array (with .values) and use axis=0 then it seems to work.

Check results for window=1:

print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]    

2014-01-04 12:30:00    0.940193     # was nan, now filled
2014-01-04 13:00:00    0.078160
2014-01-04 13:30:00   -0.662918
2014-01-04 14:00:00   -0.967121

2014-01-03 12:30:00    0.947915     # day before
2014-01-03 13:00:00    0.167218
2014-01-03 13:30:00   -0.391444
2014-01-03 14:00:00   -1.157040

2014-01-05 12:30:00    0.932471     # day after
2014-01-05 13:00:00   -0.010899
2014-01-05 13:30:00   -0.934391
2014-01-05 14:00:00   -0.777203

Regarding problem #2, it will depend on your data but if you precede the above with

df = df.resample('30min')

that will give you a row of nans for all the missing rows and then you can fill them in the same as all the other nans. That's probably the simplest and fastest way if it works.

Alternatively, you could do something with groupby. My groupby-fu is weak but to give you the flavor of it, something like:

df.groupby( df.index.hour ).fillna(method='pad')

would correctly deal the issue of missing rows, but not the other things.