Comparing web content using python

We have an active/passive data center setup. We replicate the database data using GoldenGate, and maintain the application software through regular build processes. ATG publishing is included in the database data, and the targeter files are maintained via rsync. We have not been able to force the passive data center to read the newly copied targeter files into memory, so we periodically restart the passive data center application servers so they are ready to take traffic.

We wanted an alert whenever the home page differed for whatever reason, such as a new promotion being displayed on the site. We used the BeautifulSoup library to accomplish this.

#!/usr/bin/env python

import sys, string, socket, urllib2
from bs4 import BeautifulSoup
from mymail import *

a = urllib2.urlopen("http://active").read()
c = urllib2.urlopen("http://passive").read()
soup = BeautifulSoup(a)
ahp=soup.find(id="home-page-hero")
soup = BeautifulSoup(c)
chp=soup.find(id="home-page-hero")
if ahp != chp:
  print "content differs, must restart"
  pythonMail("[email protected]",['[email protected]'],"Content differs in passive DC","Please restart ATG on ecm01-15 in the passive data center","html","2")
else:
  print "content matches"
  pythonMail("[email protected]",['[email protected]'],"Content matches in passive DC, no action required","","html","3")

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.