{"id":2960,"date":"2013-05-29T10:25:10","date_gmt":"2013-05-29T15:25:10","guid":{"rendered":"http:\/\/appcrawler.com\/wordpress\/?p=2960"},"modified":"2013-05-29T20:05:20","modified_gmt":"2013-05-30T01:05:20","slug":"pig-script-to-group-url-requests-in-jboss","status":"publish","type":"post","link":"http:\/\/appcrawler.com\/wordpress\/2013\/05\/29\/pig-script-to-group-url-requests-in-jboss\/","title":{"rendered":"Pig script to group URL requests in JBOSS"},"content":{"rendered":"<p>As we move towards an enterprise data analytics platform, I take every opportunity I can to come up with simple jobs in Hadoop, Hive, and Pig.<\/p>\n<p>Below is one I ran in Pig that groups the top 50 URL requests without considering the query string.<\/p>\n<p>Script&#8230;<\/p>\n<pre lang=\"text\" line=\"1\">\r\n[root@expressdb1 pig-0.11.1]# cat urls.pig\r\nregister '.\/contrib\/piggybank\/java\/piggybank.jar';\r\ndefine DECODE org.apache.pig.piggybank.evaluation.decode.Decode();\r\np = load '\/user\/hive\/warehouse\/requests\/localhost_access_log.2013-04-22.log.1' using PigStorage(' ') as (ip,username,time,tz,method,url:chararray,proto,status,size,ms);\r\nf = limit p 10;\r\nd = foreach p generate DECODE(INDEXOF(url,'?'),-1,url,SUBSTRING(url,0,INDEXOF(url,'?'))) as url;\r\ng = group d by url;\r\ncnt = foreach g generate group, COUNT(d) as c;\r\nb = order cnt by c desc, group;\r\nf = limit b 50;\r\ndump f;\r\n[root@expressdb1 pig-0.11.1]#\r\n<\/pre>\n<p>&#8230;and output of run&#8230;<\/p>\n<pre lang=\"text\" line=\"1\">\r\n[root@expressdb1 pig-0.11.1]# bin\/pig -4 nolog.conf -f urls.pig\r\n(\/checkout\/gadgets\/minicartcontents.jsp,159267)\r\n(\/includes\/header_tools.jsp,159221)\r\n(\/static\/js\/s_code_exp.jsp,146266)\r\n(\/catalog\/gadgets\/recently_viewed_items.jsp,102142)\r\n(\/static\/js\/refinements.js,79580)\r\n(\/catalog\/gadgets\/productList_filter.jsp,78103)\r\n(\/akamai\/akamai-sureroute-test-object.htm,62116)\r\n(\/catalog\/gadgets\/color_size_gadget.jsp,60576)\r\n(\/mobile\/includes\/mobile_header_tools.jsp,44725)\r\n(\/mobile\/static\/js\/s_code_exp.jsp,44064)\r\n(\/user\/login.jsp,41088)\r\n(\/,37225)\r\n(\/static\/js\/zoomer.js,35368)\r\n(\/catalog\/gadgets\/zoomerDroplet.jsp,27592)\r\n(\/mobile\/catalog\/gadgets\/categoryProductList.jsp,20261)\r\n(\/mobile\/catalog\/gadgets\/product_details_color_size_gadget.jsp,18528)\r\n(\/favicon.ico,14687)\r\n(\/exp-mobile-favicon.png,12458)\r\n(\/checkout\/basket.jsp,8567)\r\n(\/catalog\/gadgets\/express_view.jsp,8130)\r\n(\/common\/hp_subscribe.jsp,8117)\r\n(\/static\/js\/expressView.js,7607)\r\n(\/mobile\/,6598)\r\n(\/mobile\/content.jsp,6409)\r\n(\/catalog\/actions\/cart-submit.jsp,5640)\r\n(\/includes\/shoppingCartItemCount.jsp,5503)\r\n(\/catalog\/urls\/cart-submit-success.jsp,5179)\r\n(\/catalog\/product_detail.jsp,4406)\r\n(\/search\/search.jsp,4390)\r\n(\/content.jsp,4228)\r\n(\/mobile\/images\/linked-arrow.png,3507)\r\n(\/mobile\/bestselling_background.jpg,3008)\r\n(\/checkout\/checkout.jsp,2676)\r\n(\/mobile\/linked-arrow.png,2262)\r\n(\/catalog\/search_results.jsp,2174)\r\n(\/catalog\/gadgets\/fs_color_size_gadget.jsp,1969)\r\n(\/mobile\/catalog\/search_results.jsp,1925)\r\n(\/mobile\/exp-mobile-favicon.png,1897)\r\n(\/mobile\/includes\/shoppingCartItemCount.jsp,1796)\r\n(\/user\/overview.jsp,1719)\r\n(\/mobile\/static\/img\/backgrounds\/listArrow.png,1660)\r\n(\/catalog\/search.cmd,1600)\r\n(\/mobile\/favicon.ico,1585)\r\n(\/mobile\/checkout\/basket.jsp,1529)\r\n(\/mobile\/catalog\/category_listing.jsp,1442)\r\n(\/checkout\/gadgets\/removeItem.jsp,1429)\r\n(\/health.jsp,1416)\r\n(\/clothing\/Women\/sec\/womenCategory,1406)\r\n(\/catalog\/category_listing.jsp,1309)\r\n(\/checkout\/,1306)\r\n[root@expressdb1 pig-0.11.1]#\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>As we move towards an enterprise data analytics platform, I take every opportunity I can to come up with simple jobs in Hadoop, Hive, and Pig. Below is one I ran in Pig that groups the top 50 URL requests&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"http:\/\/appcrawler.com\/wordpress\/2013\/05\/29\/pig-script-to-group-url-requests-in-jboss\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"footnotes":""},"categories":[19,21,46],"tags":[],"_links":{"self":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/2960"}],"collection":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/comments?post=2960"}],"version-history":[{"count":6,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/2960\/revisions"}],"predecessor-version":[{"id":2993,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/posts\/2960\/revisions\/2993"}],"wp:attachment":[{"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/media?parent=2960"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/categories?post=2960"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/appcrawler.com\/wordpress\/wp-json\/wp\/v2\/tags?post=2960"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}