{"id":5875,"date":"2016-10-20T13:44:25","date_gmt":"2016-10-20T18:44:25","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=5875"},"modified":"2016-10-20T14:05:40","modified_gmt":"2016-10-20T19:05:40","slug":"market-basket-analysis-with-python","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2016\/10\/20\/market-basket-analysis-with-python\/","title":{"rendered":"Market Basket Analysis with python"},"content":{"rendered":"<p>Very elegant example at <a href=\"http:\/\/blog.derekfarren.com\/2015\/02\/how-to-implement-large-scale-market.html\" target=_blank>http:\/\/blog.derekfarren.com\/2015\/02\/how-to-implement-large-scale-market.html<\/a>.<\/p>\n<p>What is below is just a complete script with what the link above provides, along with example of our output.  Our file was formatted slightly differently, so we had to change the split command.  We also sorted by matches to print only the most relevant matches.  It worked fine for 21 million rows, with the timings shown&#8230;<\/p>\n<pre>\r\n\r\nimport time, operator\r\nfrom collections import defaultdict\r\nfrom itertools import combinations\r\n\r\n#-----------------------------------------------------------------------------------------------------------\r\n\r\ndef update_candidates(item_lst, candidate_dct, pass_nbr):\r\n  if pass_nbr==1:\r\n    for item in item_lst:\r\n      candidate_dct[(item,)]+=1\r\n  else:\r\n    frequent_items_set = set()\r\n    for item_tuple in combinations(sorted(item_lst), pass_nbr-1):\r\n      if item_tuple in candidate_dct:\r\n        frequent_items_set.update(item_tuple)\r\n\r\n    for item_set in combinations(sorted(frequent_items_set), pass_nbr):\r\n      candidate_dct[item_set]+=1\r\n\r\n  return candidate_dct\r\n\r\n#-----------------------------------------------------------------------------------------------------------\r\n\r\ndef clear_items(candidate_dct, support, pass_nbr):\r\n  for item_tuple, cnt in candidate_dct.items():\r\n    if cnt<support or len(item_tuple)<pass_nbr:\r\n      del candidate_dct[item_tuple]\r\n  return candidate_dct\r\n\r\n#-----------------------------------------------------------------------------------------------------------\r\n\r\ndef main(file_location, support, itemset_size):\r\n  candidate_dct = defaultdict(lambda: 0)\r\n  for i in range(itemset_size):\r\n    now = time.time()\r\n    candidate_dct = data_pass(file_location, support, pass_nbr=i+1, candidate_dct=candidate_dct)\r\n    print \"pass number %i took %f and found %i candidates\" % (i+1, time.time()-now, len(candidate_dct))\r\n  return candidate_dct\r\n\r\n#-----------------------------------------------------------------------------------------------------------\r\n\r\ndef data_pass(file_location, support, pass_nbr, candidate_dct):\r\n  with open(file_location, 'r') as f:\r\n    i = 0\r\n    for line in f:\r\n      tmp = line.split('\"')\r\n      item_lst = tmp[1].split(\",\")\r\n      candidate_dct = update_candidates(item_lst, candidate_dct, pass_nbr)\r\n      i += 1\r\n    \r\n    candidate_dct = clear_items(candidate_dct, support, pass_nbr)\r\n\r\n    return candidate_dct\r\n\r\n#-----------------------------------------------------------------------------------------------------------\r\n\r\nsupport = 100\r\nitemset_size = 2\r\nitemsets_dct = main(\"h:\\export.csv\", support, itemset_size)\r\n\r\ni=0\r\n\r\n#for itemset, frequency in itemsets_dct.iteritems():\r\nfor itemset, frequency in sorted(itemsets_dct.iteritems(), key=operator.itemgetter(1), reverse=True):\r\n  if i <= 100:\r\n    print itemset, frequency\r\n  else:\r\n    break\r\n  i += 1\r\n<\/pre>\n<p>...timings below...<\/p>\n<pre>\r\nC:\\Windows\\System32>c:\\Python27\\python.exe c:\\users\\showard\\basketAnalysis.py\r\npass number 1 took 70.578000 and found 8367 candidates\r\npass number 2 took 118.376000 and found 15773 candidates\r\n('009841037', '009842037') 10542\r\n('006414291', '006416500') 8619\r\n('004795899', '004795901') 7560\r\n('008621338', '008621497') 6280\r\n('005051466', '005051803') 5635\r\n('004795786', '004795899') 5408\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Very elegant example at http:\/\/blog.derekfarren.com\/2015\/02\/how-to-implement-large-scale-market.html. What is below is just a complete script with what the link above provides, along with example of our output. Our file was formatted slightly differently, so we had to change the split command. We&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2016\/10\/20\/market-basket-analysis-with-python\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[59,8,54,4],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/5875"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=5875"}],"version-history":[{"count":8,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/5875\/revisions"}],"predecessor-version":[{"id":5884,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/5875\/revisions\/5884"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=5875"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=5875"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=5875"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}