问题描述
我有一个 pandas DataFrame,它详细说明了用户会话期间的点击"方面的在线活动.有多达 50,000 个**用户,数据框有大约 150 万个样本.显然大多数用户都有多条记录.
I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
四列是唯一的用户id,用户开始服务Registration"的日期,用户使用服务Session"的日期,总点击次数.
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
dataframe的组织结构如下:
The organization of the dataframe is as follows:
User_ID Registration Session clicks 2349876 2012-02-22 2014-04-24 2 1987293 2011-02-01 2013-05-03 1 2234214 2012-07-22 2014-01-22 7 9874452 2010-12-22 2014-08-22 2 ...
(上面还有一个以0开头的索引,但可以将User_ID设置为索引.)
(There is also an index above beginning with 0, but one could set User_ID as the index.)
我想汇总用户自注册日期以来的总点击次数.数据框(或 pandas Series 对象)将列出 User_ID 和Total_Number_Clicks".
I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
User_ID Total_Clicks 2349876 722 1987293 341 2234214 220 9874452 1405 ...
如何在 pandas 中做到这一点?这是由 .agg() 完成的吗?每个 User_ID 都需要单独求和.
How does one do this in pandas? Is this done by .agg()? Each User_ID needs to be summed individually.
由于有 150 万条记录,这是否可以扩展?
As there are 1.5 million records, does this scale?
推荐答案
IIUC你可以使用groupby, sum 和 reset_index:
IIUC you can use groupby, sum and reset_index:
print df User_ID Registration Session clicks 0 2349876 2012-02-22 2014-04-24 2 1 1987293 2011-02-01 2013-05-03 1 2 2234214 2012-07-22 2014-01-22 7 3 9874452 2010-12-22 2014-08-22 2 print df.groupby('User_ID')['clicks'].sum().reset_index() User_ID clicks 0 1987293 1 1 2234214 7 2 2349876 2 3 9874452 2
如果第一列User_ID是index:
print df Registration Session clicks User_ID 2349876 2012-02-22 2014-04-24 2 1987293 2011-02-01 2013-05-03 1 2234214 2012-07-22 2014-01-22 7 9874452 2010-12-22 2014-08-22 2 print df.groupby(level=0)['clicks'].sum().reset_index() User_ID clicks 0 1987293 1 1 2234214 7 2 2349876 2 3 9874452 2
或者:
print df.groupby(df.index)['clicks'].sum().reset_index() User_ID clicks 0 1987293 1 1 2234214 7 2 2349876 2 3 9874452 2
正如 Alexander 所指出的,您需要在 groupby 之前过滤数据,如果 Session 日期少于每个 User_ID 的 Registration 日期:
As Alexander pointed, you need filter data before groupby, if Session dates is less as Registration dates per User_ID:
print df User_ID Registration Session clicks 0 2349876 2012-02-22 2014-04-24 2 1 1987293 2011-02-01 2013-05-03 1 2 2234214 2012-07-22 2014-01-22 7 3 9874452 2010-12-22 2014-08-22 2 print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index() User_ID clicks 0 1987293 1 1 2234214 7 2 2349876 2 3 9874452 2
我更改了 3. 行数据以获得更好的样本:
I change 3. row of data for better sample:
print df Registration Session clicks User_ID 2349876 2012-02-22 2014-04-24 2 1987293 2011-02-01 2013-05-03 1 2234214 2012-07-22 2012-01-22 7 9874452 2010-12-22 2014-08-22 2 print df.Session >= df.Registration User_ID 2349876 True 1987293 True 2234214 False 9874452 True dtype: bool print df[df.Session >= df.Registration] Registration Session clicks User_ID 2349876 2012-02-22 2014-04-24 2 1987293 2011-02-01 2013-05-03 1 9874452 2010-12-22 2014-08-22 2 df1 = df[df.Session >= df.Registration] print df1.groupby(df1.index)['clicks'].sum().reset_index() User_ID clicks 0 1987293 1 1 2349876 2 2 9874452 2