TSQL 时间序列模式数据挖掘

问题描述

以包含以下 3 个字段的 SQL 表为例:

Take a SQL table with the following 3 fields:

Id,TimeStamp,Item,UserId

我想确定会话中 UserId 最常见的 Item 序列.会话将简单地由时间阈值定义(即，如果 X 分钟内没有完整内容，则未来的任何条目都将被分组到一个新会话中).

I would like to determine the most common sequences of Item for a UserId in a session. A session would simply be defined by a threshold of time (i.e. if there are no entires for X minutes, any future entries would be grouped into a new session).

理想情况下，项目序列可以有一种模糊分组，其中序列中的一个或两个差异仍然可以被视为相同并组合在一起.

Ideally, the sequence of Items could have a sort of fuzzy grouping where one or two differences in the sequence could still be counted as the same and grouped together.

有人知道我如何在 SQL 中解决这个问题吗?

Anyone know how I might tackle this problem in SQL?

更新:
为了澄清，让我们假设 Items 是杂货店岛.我有一个月的人去杂货店.基本问题是人们使用什么岛以及它的顺序是什么.他们最常去的是1,2,3还是1,2,1,3,4?

(现在我很好奇用户在我们网站上的路径，但你知道，杂货店更直观).

(Right now I am curious about paths of users on our sites, but you know, grocery store is more visual).

更新 2:
这是一个简单的案例:

Update 2:
Here is a simple case:

CREATE Table #StoreActivity
(
    id int,
    CreationDate datetime ,
    Isle int,
    UserId int
)

Insert INTO #StoreActivity
Values
    (1, CAST('12-1-2011 03:10:01' AS Datetime), 1, 2222),
    (2, CAST('12-1-2011 03:10:07' AS Datetime), 1, 1111),
    (3, CAST('12-1-2011 03:10:12' AS Datetime), 2, 2222),
    (4, CAST('12-1-2011 04:10:01' AS Datetime), 1, 2222),
    (5, CAST('12-1-2011 04:10:23' AS Datetime), 2, 2222)

Select * from #StoreActivity
DROP Table #StoreActivity

/* So with the above data, we have 2 sequences if we declare a session or visit dead if there is no activity for a minute : `1,2` (With a count of 2), and `1` (with a count of 1)*/

推荐答案

WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY TimeStamp, Id) AS rn,
                ROW_NUMBER() OVER (PARTITION BY UserId, Item ORDER BY TimeStamp, Id) AS rnd
        FROM    mytable
        )
SELECT  *,
        rnd - rn AS sequence
FROM    q

sequence 列将在给定 UserId 的序列中的所有记录之间共享.您可以对其进行分组或做任何您喜欢的事情.

The sequence column will be shared among all records in a sequence for a given UserId. You can group on it or do whatever you like.