Beautiful Soup

Beautiful Soup is a very handy python module that implements a down-n-dirty HTML parser. From the Introduction:

Wouldn’t it be nice if there were a parser that could do the tree traversal stuff for you? You could tell it “Find all the links”, or “Find all the links of class externalLink”, or “Find all the links whose urls match “foo.com”, or “Find the table heading that’s got bold text, then give me that text.”

Beautiful Soup can do all this–and less. It won’t choke if you give it ill-formed markup: it’ll just give you access to a correspondingly ill-formed data structure. It doesn’t care if you give it fake HTML tags or if the namespaces are wrong. It accepts that you’re doing this to get some data into a more usable format. It appreciates that if the data were well-formed to begin with, you probably wouldn’t be doing what you’re doing.

I just used it against a crusty old CMS that was holding some horribly marked up HTML data hostage. Good tool.


About this entry