Processing HTML using Python

The other day I found myself with a list of several hundred galaxies morphologically classified as either compact or uncertain in a 2010 paper by Gendre et. al. The flux contours of these galaxies are particularly small and the object is often unresolved by the FIRST VLA survey. But I wanted to get an upper limit on their physical size by querying the FIRST catalog for their calculated deconvolved major axis diameters. This arcsec measurement could be converted to a projected linear size in kpc if the redshift was known. But of course I didn’t want to type in the position coordinates for each source individually! So I wrote a python script that would access the FIRST catalog search webpage, fill in the RA and DEC for each of my objects (provided in a text file list), and return the major axis diameter of the closest match. In the process I had to learn about parsing HTML pages using a python module called  Mechanize and the BeautifulSoup package. This sort of process could be extended to all sorts of work- and personal-related tasks that involve pulling information from online content. I found the introduction by Weekend Codes to be particularly helpful in getting started.

So this is the .dat file I had to work with. In total it contained 322 objects. The first six columns containing the right ascension and declination coordinates for each object are the only data I use to query the FIRST search form. The rest of the information just helps me keep track of the object’s name, type, redshift, flux, etc.


So one by one I wanted my script to take the position information from the list, fill in the online form and return the result. If done manually, this is the returned page from one queried position coordinate:

Here I’ve selected a search radius of 15 arcsec and for this particular object, the FIRST catalog returns two matches. I’ve also selected the text output format, which opens a new window containing just text. I’m interested in the deconvolved major axis for the closest match (so in this case, a 10.45 arcsec diameter for a search distance of 0.5 arcsec). This value gives a good first approximation to the true source size, and comes from fitting a Gaussian elliptical to the object, after removing the beam information from the map (a process known as deconvolving).

So let’s create script to automate this process!

First I downloaded and installed the Mechanize module (to access webpages and fill the online forms) and BeautifulSoup (to ease the parsing of the HTML). See my tutorial on installing modules and packages here.

Now we’re ready to go. Begin a new python script (i.e. name.py) with:

import sys
import string
import mechanize
from BeautifulSoup import BeautifulSoup

And define the input and output files:

sourcefile='input_file_name.dat'
outfile='output_file_name.dat'

This next part is copied from the Weekend Code tutorial linked above. Essentially, we invoke Mechanize to emulate a browser.

# Browser
br = mechanize.Browser()

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Now, the output is going to be another text file with a list of objects and their returned major axis sizes, among other things, so let’s initialize that file with a header containing the titles of each column:

fileout = open(outfile, 'w')
fileout.write('Name RA_DEC Search_Dist/arcsec Majaxis_Size/arcsec \n')
fileout.close()

And now we read in the contents of the input file over a while loop and begin querying the webpage:

file  = open(sourcefile, 'r')
while 1:
  line=file.readline()
  #skip the header line containing the string 'Name'
  if 'Name' in line:
    continue
  if not line: break
  items=string.split(line)

  RA_DEC=items[0]+' '+items[1]+' '+items[2]+' '+items[3]+' '+items[4]+' '+items[5]
  ObjName=items[6]

  # Open the site we want to query
  br.open('http://sundog.stsci.edu/cgi-bin/searchfirst')

  # Select the first (index zero) form
  br.select_form(nr=0)

  # Fill in form. Note that this requires knowledge of what the forms are called.
  # See Weekend Codes for help with this.
  br.form['RA'] = RA_DEC
  br.form['Radius'] = '15' #a 15 arcsec search radius
  br.form['Text']= ['1'] #outputs in HTML (0) or Text (1)

  # Submit query
  br.submit()

  #read in the returned webpage and parse into a 'soup' using BeautifulSoup:
  html = br.response().read()
  soup = BeautifulSoup(html)
  #slit into strings separated by line
  txtsoup=str(soup).split('\n')

  #read string lines into a list 'data'
  data=[]
  for i in txtsoup:
    data.append(i)
  del data[0:14] #remove the 15 line header
  del data[-1] #remove empty last element

  #define new empty lists to which we'll append search distances and major axis sizes
  dist=[] #list of search distances, in arcsec.
  majax=[] #list of returned deconvolved major axis diameters, in arcsec.
  for n in range(len(data)):
    #split each line from data by single space ' ', which is default for split()
    dataline=data[n].split()
    dist.append(dataline[0])
    majax.append(dataline[11])
  tuplearray=zip(dist, majax)
  #convert tuples into lists so we can reference by index
  listarray=list(tuplearray)
  #sort listarray by shortest search distance, since they're not always in order on returned webpage
  sorted(listarray, key=lambda x: x[0])

  #create joined string separated by single space to print to outfile
  nextline=string.join([ObjName,RA_DEC,listarray[0][0],listarray[0][1],'\n'],' ')

  #open output file in append mode and write to file
  fileout = open(outfile, 'a')
  fileout.write(nextline)
  fileout.close()

The end result is an output .dat file (or .txt file if you prefer) that contains the name, position, closest search distance and corresponding deconvolved major axis diameter. I then feed this file into other scripts to convert my arcsec diameter into kpc using my known redshift, and plot the results into a radio power vs projected linear size graph.

This code could probably easily be made more efficient and cleaner, but it does the job and makes my work a whole lot quicker. The same principles could be used to query just about any webpage form, including online email access and the like.