Monatsarchiv für August 2011

Hagen Fritsch

lxml-based BeautifulSoup loader

With ElementSoup there is already a tool, that allows you to create an etree Document using the more fault-tolerant BeautifulSoup-parser. However, looking for the oposite direction (i.e. creating a BeautifulSoup document using the lxml-parser was not yet possible).
In my experience, I discover BeautifulSoup’s API much more intuitive and useful, especially for quick scraping and data manipulation tasks. So the only reason to use lxml in the first place, is that its parser is much quicker and consumes less memory.
Recently I had a workflow made for BeautifulSoup based documents, but found, that BeautifulSoup was too slow to parse my several MB document. So here is lxmlsouper, a tool, that uses lxml to parse the document and creates the BeautifulSoup DOM from it, which is at least way quicker than the native way.

Notes: feel free to exchange the etree-Implementation with whatever you like best. Also this does not emulate the BeautifulSoup-API on top of etree, but uses the etree data to create a BeautifulSoup document from scratch, copying everything.


import lxmlsouper
data = unicode(open("bigfile.html").read(), "utf8")
soup = lxmlsouper.fastSoupLoader(data)