Market Basket Analysis with python

Very elegant example at http://blog.derekfarren.com/2015/02/how-to-implement-large-scale-market.html.

What is below is just a complete script with what the link above provides, along with example of our output. Our file was formatted slightly differently, so we had to change the split command. We also sorted by matches to print only the most relevant matches. It worked fine for 21 million rows, with the timings shown…


import time, operator
from collections import defaultdict
from itertools import combinations

#-----------------------------------------------------------------------------------------------------------

def update_candidates(item_lst, candidate_dct, pass_nbr):
  if pass_nbr==1:
    for item in item_lst:
      candidate_dct[(item,)]+=1
  else:
    frequent_items_set = set()
    for item_tuple in combinations(sorted(item_lst), pass_nbr-1):
      if item_tuple in candidate_dct:
        frequent_items_set.update(item_tuple)

    for item_set in combinations(sorted(frequent_items_set), pass_nbr):
      candidate_dct[item_set]+=1

  return candidate_dct

#-----------------------------------------------------------------------------------------------------------

def clear_items(candidate_dct, support, pass_nbr):
  for item_tuple, cnt in candidate_dct.items():
    if cnt

...timings below...

C:\Windows\System32>c:\Python27\python.exe c:\users\showard\basketAnalysis.py
pass number 1 took 70.578000 and found 8367 candidates
pass number 2 took 118.376000 and found 15773 candidates
('009841037', '009842037') 10542
('006414291', '006416500') 8619
('004795899', '004795901') 7560
('008621338', '008621497') 6280
('005051466', '005051803') 5635
('004795786', '004795899') 5408

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.