association_rule_mining

**Association rule mining** is a popular pattern mining task. It is used to find strong associations between values in some data. Association rule mining is used to analyze data that has the form of a table described using binary attributes.

A typical example of such data is a **transaction database,** which lists the items (products) purchased by one or more customers. For example, the table below shows a small transaction database containing four transactions. The first transaction indicates that a customer has purchased some items **pasta****, lemon, bread, **and **orange **together.

Transaction | items appearing in the transaction |

T1 | {pasta, lemon, bread, orange} |

T2 | {pasta, lemon} |

T3 | {pasta, orange, cake} |

T4 | {pasta, lemon, orange, cake} |

Each transaction has a name called its **transaction identifier. **For example, the transaction identifiers of the four transactions depicted above are T1, T2, T3 and T4, respectively.

The goal of association rule mining is to find all the **association rules ** that are valid. An ** association rules ** has the form X –> Y indicating that if some set of values X appears in a transaction, the set of values Y is also likely to appear in the same transaction. It is required that the sets X and Y are not empty and do not contain the same values.

An example association rule is pasta =⇒ lemon which indicates that if someone buys pasta, he is likely to also buy lemon. In this example X = {pasta} and Y = {lemon}. The sets X and Y are said to be itemsets and pasta and lemon are values that are said to be items.

To find rules that are interesting two measures have been initially proposed, called **support** and **confidence**. The **support of a rule** is how many transactions contain X and Y together. The **confidence of a rule** is how many times X and Y appear together divided by how many times X appears. For example, the support of pasta =⇒ lemon is 3 because it appears in three transactions, which are T1, T2 and T3. The confidence of that rule is 0.75 because pasta and lemon appears in three transactions, while pasta appears in 4 transactions. Thus 3/4 = 0.75. This rule can thus be interpreted as if a customer buys pasta 75% of the time he will also buy lemon.

To find association rules, a user must have a **transaction database** (as the one presented above), and must also choose a value for a parameter called ** minsup** , which means

The **output of association rule mining ** is all valid association rules that is all rules that have a support that is no less than *minsup* and a confidence that is not less than *minconf* in the database.

For example, for the above database, *minsup* = 2 and *minconf* = 0.75, the following association rules are found:

- lemon =⇒ pasta support: 3 confidence: 1
- pasta =⇒ lemon support: 3 confidence: 0,75
- orange =⇒ pasta support: 3 confidence: 1
- pasta =⇒ orange support: 3 confidence: 0,75
- cake =⇒ pasta support: 2 confidence: 1
- cake =⇒ orange support: 2 confidence: 1
- lemon orange =⇒ pasta support: 2 confidence: 1
- orange cake =⇒ pasta support: 2 confidence: 1
- pasta cake =⇒ orange support: 2 confidence: 1
- cake =⇒ pasta orange support: 2 confidence: 1

For example, the association rule orange cake =⇒ pasta is said to be an association rule because it appears in 2 transactions, which is no less than *minsup*, and the confidence of this rule is 1, which is no less than *minconf*.

There are many applications. A few examples are:

- Finding the products purchased together by several customers in a store (each transaction indicates a set of products purchased by a customer).
- Finding the courses that students often select together in a university (each course is an item, and the course selection of each student is a transaction).
- Finding the words that frequently appear together in the same sentence of a text (each sentence is a transaction and each word is an item).

Finding association rule is done in two steps:

- Finding all the frequent itemsets in a database that have a support greater or equal to
*minsup* - Ordered List Item Combining pairs of frequent itemsets to generate association rules. Each rule that has a confidence that is no less than
*minconf/ is then output.*

Because the first step is the most challenging, most researchers are working on that first step to develop more efficient algorithms. The second step is generally always done using the same algorithm initially proposed by Agrawal & Srikant (1993).

Here are a few survey papers that gives an overview of itemset mining (the first step for generating association rules).

- Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, Wiley, e1207 doi: 10.1002/widm.1207, 18 pages.
- Luna, J. M., Fournier-Viger, P., Ventura, S. (2019). Frequent Itemset Mining: a 25 Years Review. WIREs Data Mining and Knowledge Discovery, Wiley, 9(6):e1329.

*R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.*The problem of association rule mining and frequent itemset mining (originally called large itemset mining) was first introduced in this paper. This paper describes two algorithms to generate association rules from frequent itemset

To apply **association rule mining**, the SPMF software provides open-source efficient implementations of many algorithms and variations. The **SPMF software** can be downloaded from the website: http://www.philippe-fournier-viger.com/spmf/ .

To install the software, you may follow the instructions on the **download page **of that website. Then, you may check the **documentation page **which provides examples of how to run various algorithms for association rule mining. Datasets can also be found on the datasets page of the SPMF website.

association_rule_mining.txt · Last modified: 2021/06/22 10:39 by philfv