Sequential pattern mining is a popular pattern mining task. It is used to find subsequences that appear frequently appear in a set of sequences. Sequential pattern mining is used to analyze data that is represented as sequences of symbols.
The input data for sequential pattern mining is a sequence database. It is a set of sequences. Each sequence is an ordered list of itemsets. And an itemset is a set of symbols. For example, consider the sequence database below, which represents shopping data. This database contains four sequences (rows) called S1, S2, S3 and S4.
|S1||(a), (a b c), (a c), (d), (c f)|
|S2||(a d), ( c ) , (b c), (a e)|
|S3||(e f), (a b), (d f), ( c ), (b)|
|S4||(e), (g), (a f), ( c ), (b), ( c )|
The first sequence indicates that a customer has purchased some item (a). Then the customer has bought some items a, b and c together. Then the customer, has bought items a and c together. Then, he bought item d, and finally the customer bought items c and f together.
The goal of sequential pattern mining is to find all the frequent sequential patterns. A frequent sequential pattern is a subsequence that appears in many input sequences. To find frequent sequential patterns, a user must have a sequence database (as the one presented above), and must also choose a value for a parameter called minsup , which means minimum support threshold.
The output of frequent sequential pattern mining is all subsequences that appear at least minsup sequences of the database.
For example, for the above database and minsup = 2, the three following subsequences are frequent sequential patterns:
(b c) (a) because it appears in at least two sequences
(f) (b) because it appears in at least two sequences
(f) (b) ( c ) because it appears in at least two sequences
There are also other frequent sequential patterns, but here only a few are listed for brevity.
There are many applications. A few examples are:
Numerous algorithms have been designed for sequential pattern mining. But most of them are inspired by one of the following classic algorithms: SPAM, SPADE, PrefixSpan, GSP, CM-SPAM and CM-SPADE.
Here are a few survey papers that gives an overview of itemset mining.
To apply frequent itemset mining, the SPMF software provides open-source efficient implementations of many algorithms and variations. The SPMF software can be downloaded from the website: http://www.philippe-fournier-viger.com/spmf/ .
To install the software, you may follow the instructions on the download page of that website. Then, you may check the documentation page which provides examples of how to run various algorithms such as GSP, PrefixSpan, CM-SPAM, CM-SPADE and others for sequential pattern mining. Besides, you may check the datasets page of that website provides several benchmark datasets for testing the algorithms and comparing their performance.