问题描述
Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN -- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean?)
I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"]), but I feel there must be a more pythonic/pandas-y way?
Sample data set:
import numpy as np import pandas as pd # generate a 1-week time series dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min") df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C")) # generate some artificial gaps df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan print df["2014-01-04 08:00":"2014-01-04 16:00"] A B C 2014-01-04 08:00:00 0.675720 2.186484 -0.033969 2014-01-04 08:30:00 -0.897217 1.332437 -2.618197 2014-01-04 09:00:00 0.299395 0.837023 1.346117 2014-01-04 09:30:00 0.223051 0.913047 NaN 2014-01-04 10:00:00 NaN 1.395480 NaN 2014-01-04 10:30:00 NaN -0.800921 NaN 2014-01-04 11:00:00 NaN -0.932760 NaN 2014-01-04 11:30:00 0.057219 -0.071280 NaN 2014-01-04 12:00:00 0.215810 -1.099531 NaN 2014-01-04 12:30:00 -0.532563 NaN NaN 2014-01-04 13:00:00 -0.697872 NaN NaN 2014-01-04 13:30:00 -0.028541 NaN NaN 2014-01-04 14:00:00 -0.073426 NaN NaN 2014-01-04 14:30:00 -1.187419 0.221636 NaN 2014-01-04 15:00:00 1.802449 0.144715 NaN 2014-01-04 15:30:00 0.446615 1.013915 -1.813272 2014-01-04 16:00:00 -0.410670 1.265309 -0.198607 [17 rows x 3 columns]
(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )
/edit: A sample solution that I'm not really happy with:
# specify the columns of df where gaps should be filled cols = ["A", "B", "C"] for col in cols: for idx, rows in df.iterrows(): if np.isnan(df.ix[idx, col]): # replace with mean of adjacent days df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col])
There is two things I don't like about this solution:
- If there is a single line missing or duplicated anywhere, this fails. In the last line, I would like to subtract "one day" all the time, no matter if that is 47, 48 or 49 rows away. Also, it would be good of I could extend the range (e.g. -3 days to +3 days) without manually writing a list for the index.
- I would like to get rid of the loops, if that is possible.
This should be a faster and more concise way to do it. Main thing is to use the shift() function instead of the loop. Simple version would be this:
df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )
It turned out to be really hard to generalize this, but this seems to work:
df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in range(-48*window,48*(window+1),48) ], axis=0 )
I'm not sure, but suspect there might be a bug with nanmean and it's also the same reason you got missing values yourself. It seems to me that nanmean cannot handle nans if you feed it a dataframe. But if I convert to an array (with .values) and use axis=0 then it seems to work.
Check results for window=1:
print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"] print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"] 2014-01-04 12:30:00 0.940193 # was nan, now filled 2014-01-04 13:00:00 0.078160 2014-01-04 13:30:00 -0.662918 2014-01-04 14:00:00 -0.967121 2014-01-03 12:30:00 0.947915 # day before 2014-01-03 13:00:00 0.167218 2014-01-03 13:30:00 -0.391444 2014-01-03 14:00:00 -1.157040 2014-01-05 12:30:00 0.932471 # day after 2014-01-05 13:00:00 -0.010899 2014-01-05 13:30:00 -0.934391 2014-01-05 14:00:00 -0.777203
Regarding problem #2, it will depend on your data but if you precede the above with
df = df.resample('30min')
that will give you a row of nans for all the missing rows and then you can fill them in the same as all the other nans. That's probably the simplest and fastest way if it works.
Alternatively, you could do something with groupby. My groupby-fu is weak but to give you the flavor of it, something like:
df.groupby( df.index.hour ).fillna(method='pad')
would correctly deal the issue of missing rows, but not the other things.