Chris Olds
|
93
|
 |
|
12-16-2005 07:42 PM ET (US)
|
|
I'm using ElementTree 1.2.6 with Python 2.4 on WinXP. With ElementTree.py, I can define entities by setting the entity dict in the XMLTreeBuilder object. With cElementTree, I get different behavior depending on whether or not a DOCTYPE is present in the file. If I have a doctype, parsing works, but I get a segfault when the program finishes. If I do not have a doctype, I get 'undefined entity' exceptions, but no segfault
doc = """<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE patent-application-publication SYSTEM "pap-v15-2001-01-31.dtd" []> <patent-application-publication> <subdoc-abstract> <paragraph id="A-0001" lvl="0">A new and distinct cultivar of Begonia plant named ‘BCT9801BEG’.</paragraph> </subdoc-abstract> </patent-application-publication>"""
#from elementtree import ElementTree as et import cElementTree as et
entities = { u'rsquo' : u"’", # <!--=single quotation mark, right --> u'lsquo' : u"‘", # <!--=single quotation mark, left --> }
parser = et.XMLTreeBuilder() parser.entity.update(entities) parser.feed(doc) t = parser.close() print t.find('.//paragraph').text
|