问题描述
我正在尝试在 pandas 数据框中查找重复行.
I am trying to find duplicates rows in a pandas dataframe.
df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2']) df Out[15]: col1 col2 0 1 2 1 3 4 2 1 2 3 1 4 4 1 2 duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first') duplicate = df.loc[duplicate_bool == True] duplicate Out[16]: col1 col2 2 1 2 4 1 2
有没有办法添加引用第一个副本(保留的那个)的索引的列
Is there a way to add a column referring to the index of the first duplicate (the one kept)
duplicate Out[16]: col1 col2 index_original 2 1 2 0 4 1 2 0
注意:在我的情况下,df 可能非常大....
Note: df could be very very big in my case....
推荐答案
使用groupby,新建一列索引,然后调用duplicated:
Use groupby, create a new column of indexes, and then call duplicated:
df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') df[df.duplicated(subset=['col1','col2'], keep='first')] col1 col2 index_original 2 1 2 0 4 1 2 0
<小时>
详情
我groupby前两列然后调用transform + idxmin得到每个组的第一个索引.
I groupby first two columns and then call transform + idxmin to get the first index of each group.
df.groupby(['col1', 'col2']).col1.transform('idxmin') 0 0 1 1 2 0 3 3 4 0 Name: col1, dtype: int64
duplicated 给了我想要保留的值的布尔掩码:
duplicated gives me a boolean mask of values I want to keep:
df.duplicated(subset=['col1','col2'], keep='first') 0 False 1 False 2 True 3 False 4 True dtype: bool
剩下的只是布尔索引.