问题描述
以包含以下 3 个字段的 SQL 表为例:
Take a SQL table with the following 3 fields:
Id,TimeStamp,Item,UserId
我想确定会话中 UserId 最常见的 Item 序列.会话将简单地由时间阈值定义(即,如果 X 分钟内没有完整内容,则未来的任何条目都将被分组到一个新会话中).
I would like to determine the most common sequences of Item for a UserId in a session. A session would simply be defined by a threshold of time (i.e. if there are no entires for X minutes, any future entries would be grouped into a new session).
理想情况下,项目序列可以有一种模糊分组,其中序列中的一个或两个差异仍然可以被视为相同并组合在一起.
Ideally, the sequence of Items could have a sort of fuzzy grouping where one or two differences in the sequence could still be counted as the same and grouped together.
有人知道我如何在 SQL 中解决这个问题吗?
Anyone know how I might tackle this problem in SQL?
更新:
为了澄清,让我们假设 Items 是杂货店岛.我有一个月的人去杂货店.基本问题是人们使用什么岛以及它的顺序是什么.他们最常去的是1,2,3还是1,2,1,3,4?
(现在我很好奇用户在我们网站上的路径,但你知道,杂货店更直观).
(Right now I am curious about paths of users on our sites, but you know, grocery store is more visual).
更新 2:
这是一个简单的案例:
Update 2:
Here is a simple case:
CREATE Table #StoreActivity ( id int, CreationDate datetime , Isle int, UserId int ) Insert INTO #StoreActivity Values (1, CAST('12-1-2011 03:10:01' AS Datetime), 1, 2222), (2, CAST('12-1-2011 03:10:07' AS Datetime), 1, 1111), (3, CAST('12-1-2011 03:10:12' AS Datetime), 2, 2222), (4, CAST('12-1-2011 04:10:01' AS Datetime), 1, 2222), (5, CAST('12-1-2011 04:10:23' AS Datetime), 2, 2222) Select * from #StoreActivity DROP Table #StoreActivity /* So with the above data, we have 2 sequences if we declare a session or visit dead if there is no activity for a minute : `1,2` (With a count of 2), and `1` (with a count of 1)*/
推荐答案
WITH q AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY TimeStamp, Id) AS rn, ROW_NUMBER() OVER (PARTITION BY UserId, Item ORDER BY TimeStamp, Id) AS rnd FROM mytable ) SELECT *, rnd - rn AS sequence FROM q
sequence 列将在给定 UserId 的序列中的所有记录之间共享.您可以对其进行分组或做任何您喜欢的事情.
The sequence column will be shared among all records in a sequence for a given UserId. You can group on it or do whatever you like.