sequential_pattern_mining

**Sequential pattern mining** is a popular pattern mining task. It is used to find subsequences that appear frequently appear in a set of sequences. **Sequential pattern mining** is used to analyze data that is represented as sequences of symbols.

The input data for sequential pattern mining is a **sequence database. **It is a set of sequences. Each **sequence **is an ordered list of** itemsets. **And an **itemset** is a set of symbols. For example, consider the sequence database below, which represents shopping data. This database contains four sequences (rows) called S1, S2, S3 and S4.

Sequence | items |

S1 | (a), (a b c), (a c), (d), (c f) |

S2 | (a d), ( c ) , (b c), (a e) |

S3 | (e f), (a b), (d f), ( c ), (b) |

S4 | (e), (g), (a f), ( c ), (b), ( c ) |

The first sequence indicates that a customer has purchased some item (a). Then the customer has bought some items a, b and c together. Then the customer, has bought items a and c together. Then, he bought item d, and finally the customer bought items c and f together.

The goal of sequential pattern mining is to find all the **frequent sequential patterns. A frequent sequential pattern **is a subsequence that appears in many input sequences. To find frequent sequential patterns, a user must have a **sequence database** (as the one presented above), and must also choose a value for a parameter called** minsup** , which means

The **output of frequent sequential pattern mining **is all subsequences that appear at least *minsup* sequences of the database.

For example, for the above database and *minsup* = 2, the three following subsequences are frequent sequential patterns:

(b c) (a) because it appears in at least two sequences

(f) (b) because it appears in at least two sequences

(f) (b) ( c ) because it appears in at least two sequences

There are also other frequent sequential patterns, but here only a few are listed for brevity.

There are many applications. A few examples are:

- Find the frequent sequences of purchases made by customers in a store (each sequence is a list of purchased items by a customer).
- Find the frequent sequence of locations visited by tourists in a city (each sequence is a list of locations visited by a tourist).
- Find the frequent sequences of words that appear in a book (each sentence is a sequence of words)

Numerous algorithms have been designed for sequential pattern mining. But most of them are inspired by one of the following classic algorithms: **SPAM**, **SPADE**, **PrefixSpan**, **GSP**, **CM-SPAM **and **CM-SPADE**.

Here are a few survey papers that gives an overview of itemset mining.

- Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition (DSPR), vol. 1(1), pp. 54-77.
- Gan, W., Lin, J. C. W., Fournier-Viger, P., Chao, H.-C., Yu, P. S. (2019)
**A Survey of Parallel Sequential Pattern Mining**. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(3): 25:1-25:34.

- …

To apply** frequent itemset mining**, the SPMF software provides open-source efficient implementations of many algorithms and variations. The **SPMF software** can be downloaded from the website: http://www.philippe-fournier-viger.com/spmf/ .

To install the software, you may follow the instructions on the **download page **of that website. Then, you may check the **documentation page **which provides examples of how to run various algorithms such as GSP, PrefixSpan, CM-SPAM, CM-SPADE and others for sequential pattern mining. Besides, you may check the **datasets ****page **of that website provides several benchmark datasets for testing the algorithms and comparing their performance.

sequential_pattern_mining.txt · Last modified: 2021/06/21 11:22 by philfv