Pandas 将多个数据帧与时间戳索引对齐

问题描述

在过去的几天里，这一直是我生活的祸根.我有许多 Pandas 数据框，其中包含频率不规则的时间序列数据.我尝试将它们对齐到单个数据框中.

This has been the bane of my life for the past couple of days. I have numerous Pandas Dataframes that contain time series data with irregular frequencies. I try to align these into a single dataframe.

下面是一些代码，具有代表性的数据帧，df1、df2 和 df3(我实际上有 n=5，我将不胜感激适用于所有 n>2) 的解决方案:

Below is some code, with representative dataframes, df1, df2, and df3 ( I actually have n=5, and would appreciate a solution that would work for all n>2):

# df1, df2, df3 are given at the bottom
import pandas as pd
import datetime

# I can align df1 to df2 easily
df1aligned, df2aligned = df1.align(df2)
# And then concatenate into a single dataframe
combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 )
# Since I don't know any better, I then try to align df3 to combined_1_n_2  manually:
combined_1_n_2.align(df3)
error: Reindexing only valid with uniquely valued Index objects

我知道为什么会出现此错误，因此我删除了 combined_1_n_2 中的重复索引并重试:

I have an idea why I get this error, so I get rid of the duplicate indices in combined_1_n_2 and try again:

combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first()
combined_1_n_2.align(df3) # But stll get the same error
error: Reindexing only valid with uniquely valued Index objects

为什么会出现此错误?即使这有效，它也是完全手动且丑陋的.如何对齐 >2 个时间序列并将它们组合在一个数据帧中?

Why am I getting this error? Even if this worked, it is completely manual and ugly. How can I align >2 time series and combine them in a single dataframe?

数据:

df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]}, 
                     index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0] 
                     for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000', 
                     '2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ])   

df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]},
                    index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0] 
                     for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000', 
                     '2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ])   

df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]},
                    index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0] 
                     for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000', 
                     '2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])

推荐答案

您的具体错误是由于 combined_1_n_2 的列名有重复(两列都将命名为价格").您可以重命名列，然后第二个对齐就可以了.

Your specific error is due the column names of combined_1_n_2 having duplicates (both columns will be named 'price'). You could rename the columns and the second align would work.

另一种方法是链接 join 运算符，该运算符合并索引上的帧，如下所示.

One alternative way would be to chain the join operator, which merges frames on the index, as below.

In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2')
Out[23]: 
                              price   price_1  price_2
2008-06-01 06:03:52.281000      NaN       NaN  67.6560
2008-06-01 06:03:52.359000      NaN       NaN  67.8750
2008-06-01 06:03:59.614000  62.1250       NaN      NaN
2008-06-01 06:03:59.692000  62.2500       NaN      NaN
2008-06-01 06:13:34.524000      NaN  241.0625      NaN
2008-06-01 06:13:34.602000      NaN  241.5000      NaN
2008-06-01 06:13:34.848000      NaN       NaN  67.8125
2008-06-01 06:13:34.926000      NaN       NaN  67.7500
2008-06-01 06:15:05.321000      NaN       NaN  67.6875
2008-06-01 06:15:05.399000      NaN  241.3750      NaN
2008-06-01 06:15:05.399000      NaN  241.2500      NaN
2008-06-01 06:15:42.004000  62.2375       NaN      NaN
2008-06-01 06:15:42.082000      NaN  241.3750      NaN
2008-06-01 06:15:42.083000  61.9250       NaN      NaN
2008-06-01 06:17:01.654000  61.9125       NaN      NaN