Association rules in java

Assocation rules are a concept in which relationships between different elements of a common set can be established. For example, a study may be undertaken to determine the impact of one externally employed parent on childrens GPA in a household, or two externally employed parents, etc.

It is similar to correlation conceptually, but it simply looks for the most frequently occurring combinations.

The most common usage, and canonical example, is market basket analysis. In this case, the goal is to determine the likelihood of a shopper to purchase one SKU when another SKU has been purchased. Using this, we can provide suggestions to shoppers, much like Amazon does on their website.

This is infinitely easier in the R language, but this is an example of the math under the hood. It is an O(n) algorithm, which means the entire list is only read one time.

The one thing to consider is that depending on the number of SKU’s in your catalog, this can result in a large hashmap. The maximum size can be estimated by using a derivative of the Gauss Series Summation, or ((#SKU’s – 1) * #SKU’s) / 2. For example, 10,000 SKU’s would result in a maximum hashmap size of 49,995,000, or just under 50MB. Given the fact the algorithm below stores the SKU combination (SKU1 and SKU2), as well as the count of times that combination was found, you can multiply the hashmap size * the datatype size of each element noted to get the theoretical maximum size. Using this, you can ensure you start the JVM with a large enough heap.

On to the code…

import java.util.*;
import java.io.*;
/*
File structure is company_id, store_id, transactionnum, date, sku

the pk are the items above
as such, we need to still do the same thing.
keep a running list of producta --> productb pairs
as we read the file, we will take each pair for an order and add it to a hashmap
if the combo already exists, increment its counter
if not, add it with 1
print the hashmap at the end

for each order, build its map of items - how do we define an order?  company,store,workstation,txn#,businessdate
at the end, read the map
*/

public class MarketBasketAnalysis2 {
  public static void main(String args[]) {
    try {
      BufferedReader bf = new BufferedReader(new FileReader(args[0]));
      String line = "";
      HashMap hm = new HashMap();
      while ((line = bf.readLine()) != null) {
		String[] s = line.split(",",-1);
		if (s.length < 2)
		  continue;
		//if map has sale, add item to its array
		ArrayList tmp = hm.get(s[0]);
		if (tmp == null) {
		  tmp = new ArrayList();
	    }
        tmp.add(s[1]);
        Collections.sort(tmp);
        hm.put(s[0],tmp);
      }
      System.out.println("File read with " + hm.size() + " lines");
      HashMap hm2 = new HashMap();
      for (Map.Entry entry : hm.entrySet()) {
		ArrayList tmp = entry.getValue();
		for (int j = 0; j < tmp.size() - 1; j++) {
		  if (tmp.get(j) != tmp.get(j + 1)) {
		    if (hm2.get(tmp.get(0)) == null) {
			  hm2.put(tmp.get(j) + "~" + tmp.get(j + 1),1);
		    }
		    else {
 	          hm2.put(tmp.get(j) + "~" + tmp.get(j + 1),(int)hm2.get(tmp.get(j) + "~" + tmp.get(j + 1)) + 1);
		    }
		  }
	    }
	  }
      for (Map.Entry entry : hm2.entrySet()) {
        String key = entry.getKey();
        int value = entry.getValue();
        String s[] = key.split("~");
        if (value > 1)
          System.out.println(key + " " + value);
      }
    }
    catch (Exception ex) {
      ex.printStackTrace();
    }
  }
}

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.