How to output CDATA using ElementTree

xml.etree.elementtree cdata
lxml etree tutorial
lxml pretty print
lxml etree xpath
xml cdata
nameerror name 'etree' is not defined
python elementtree
lxml get text

I've discovered that cElementTree is about 30 times faster than xml.dom.minidom and I'm rewriting my XML encoding/decoding code. However, I need to output XML that contains CDATA sections and there doesn't seem to be a way to do that with ElementTree.

Can it be done?

After a bit of work, I found the answer myself. Looking at the ElementTree.py source code, I found there was special handling of XML comments and preprocessing instructions. What they do is create a factory function for the special element type that uses a special (non-string) tag value to differentiate it from regular elements.

def Comment(text=None):
    element = Element(Comment)
    element.text = text
    return element

Then in the _write function of ElementTree that actually outputs the XML, there's a special case handling for comments:

if tag is Comment:
    file.write("<!-- %s -->" % _escape_cdata(node.text, encoding))

In order to support CDATA sections, I create a factory function called CDATA, extended the ElementTree class and changed the _write function to handle the CDATA elements.

This still doesn't help if you want to parse an XML with CDATA sections and then output it again with the CDATA sections, but it at least allows you to create XMLs with CDATA sections programmatically, which is what I needed to do.

The implementation seems to work with both ElementTree and cElementTree.

import elementtree.ElementTree as etree
#~ import cElementTree as etree

def CDATA(text=None):
    element = etree.Element(CDATA)
    element.text = text
    return element

class ElementTreeCDATA(etree.ElementTree):
    def _write(self, file, node, encoding, namespaces):
        if node.tag is CDATA:
            text = node.text.encode(encoding)
            file.write("\n<![CDATA[%s]]>\n" % text)
        else:
            etree.ElementTree._write(self, file, node, encoding, namespaces)

if __name__ == "__main__":
    import sys

    text = """
    <?xml version='1.0' encoding='utf-8'?>
    <text>
    This is just some sample text.
    </text>
    """

    e = etree.Element("data")
    cdata = CDATA(text)
    e.append(cdata)
    et = ElementTreeCDATA(e)
    et.write(sys.stdout, "utf-8")

How to output CDATA using ElementTree, How to output CDATA using ElementTree. I've discovered that cElementTree is about 30 times faster than xml.dom.minidom and I'm rewriting my XML  This recipe monkey-patches the ElementTree library to allow correct parsing and generation of CDATA sections. """. This module monkey patches the ElementTree module to fully support CDATA sections both while generating XML trees and while parsing XML documents. See usage examples at the end of this file.

lxml has support for CDATA and API like ElementTree.

python How to output CDATA using ElementTree?, python How to output CDATA using ElementTree? parser = etree.XMLParser(​encoding='utf-8') # my original xml was utf-8 and that was a  Support CDATA by xml.etree.(c)ElementTree: Type I would like to add information to CDATA in an Xml Tree. 174890/how-to-output-cdata-using-elementtree Can the

Here is a variant of gooli's solution that works for python 3.2:

import xml.etree.ElementTree as etree

def CDATA(text=None):
    element = etree.Element('![CDATA[')
    element.text = text
    return element

etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
    if elem.tag == '![CDATA[':
        write("\n<%s%s]]>\n" % (
                elem.tag, elem.text))
        return
    return etree._original_serialize_xml(
        write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml


if __name__ == "__main__":
    import sys

    text = """
    <?xml version='1.0' encoding='utf-8'?>
    <text>
    This is just some sample text.
    </text>
    """

    e = etree.Element("data")
    cdata = CDATA(text)
    e.append(cdata)
    et = etree.ElementTree(e)
    et.write(sys.stdout.buffer.raw, "utf-8")

lxml.etree.CDATA, The usual way to use it is: >>> el = Element('content') >>> el.text = CDATA('a string') >>> print(el.text) a string >>> print(tostring(el, encoding="unicode"))  ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with. When in doubt, print it out ( print(ET.tostring(root, encoding='utf8').decode('utf8')) ) - use this helpful print statement to view the entire XML document at once.

Solution:

import xml.etree.ElementTree as ElementTree

def CDATA(text=None):
    element = ElementTree.Element('![CDATA[')
    element.text = text
    return element

ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
    if elem.tag == '![CDATA[':
        write("\n<{}{}]]>\n".format(elem.tag, elem.text))
        if elem.tail:
            write(_escape_cdata(elem.tail))
    else:
        return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)

ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml

if __name__ == "__main__":
    import sys

text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""

e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)

Background:

I don't know whether previous versions of proposed code worked very well and whether ElementTree module has been updated but I have faced problems with using this trick:

etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
    if elem.tag == '![CDATA[':
        write("\n<%s%s]]>\n" % (
                elem.tag, elem.text))
        return
    return etree._original_serialize_xml(
        write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml

The problem with this approach is that after passing this exception, serializer is again treating it as normal tag afterwards. I was getting something like:

<textContent>
<![CDATA[this was the code I wanted to put inside of CDATA]]>
<![CDATA[>this was the code I wanted to put inside of CDATA</![CDATA[>
</textContent>

And of course we know that will cause only plenty of errors. Why that was happening though?

The answer is in this little guy:

return etree._original_serialize_xml(write, elem, qnames, namespaces)

We don't want to examine code once again through original serialise function if we have trapped our CDATA and successfully passed it through. Therefore in the "if" block we have to return original serialize function only when CDATA was not there. We were missing "else" before returning original function.

Moreover in my version ElementTree module, serialize function was desperately asking for "short_empty_element" argument. So the most recent version I would recommend looks like this(also with "tail"):

from xml.etree import ElementTree
from xml import etree

#in order to test it you have to create testing.xml file in the folder with the script
xmlParsedWithET = ElementTree.parse("testing.xml")
root = xmlParsedWithET.getroot()

def CDATA(text=None):
    element = ElementTree.Element('![CDATA[')
    element.text = text
    return element

ElementTree._original_serialize_xml = ElementTree._serialize_xml

def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):

    if elem.tag == '![CDATA[':
        write("\n<{}{}]]>\n".format(elem.tag, elem.text))
        if elem.tail:
            write(_escape_cdata(elem.tail))
    else:
        return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)

ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml


text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)

#tests
print(root)
print(root.getchildren()[0])
print(root.getchildren()[0].text + "\n\nyay!")

The output I got was:

<Element 'Database' at 0x10062e228>
<Element '![CDATA[' at 0x1021cc9a8>

<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>


yay!

I wish you the same result!

ElementTree CDATA support « Python recipes « ActiveState Code, ElementTree as etree except ImportError: # Older Python with class so that it would # recognize and correctly print out CDATA sections. An ElementTree will only contain comment nodes if they have been inserted into to the tree using one of the Element methods. xml.etree.ElementTree.dump (elem) ¶ Writes an element tree or element structure to sys.stdout. This function should be used for debugging only. The exact output format is implementation dependent.

Actually this code has a bug, since you don't catch ]]> appearing in the data you are inserting as CDATA

as per Is there a way to escape a CDATA end token in xml?

you should break it into two CDATA's in that case, splitting the ]]> between the two.

basically data = data.replace("]]>", "]]]]><![CDATA[>") (not necessarily correct, please verify)

lxml.etree.CDATA Python Example, The following are code examples for showing how to use lxml.etree.CDATA(). "​description") description.text = input["completions"][0]["description"] output  You can however use ElementTree.ElementTree.write() to write your XML document to a fake file: from io import BytesIO from xml.etree import ElementTree as ET document = ET.Element('outer') node = ET.SubElement(document, 'inner') et = ET.ElementTree(document) f = BytesIO() et.write(f, encoding='utf-8', xml_declaration=True) print(f.getvalue()) # your XML file, encoded as UTF-8

20.5. xml.etree.ElementTree — The ElementTree XML API, ElementTree module implements a simple and efficient API for parsing and creating XML data. for neighbor in root.iter('neighbor'): print(neighbor.attrib) . """ rough_string = ElementTree.tostring(elem, 'utf-8') reparsed = minidom.parseString(rough_string) return reparsed.toprettyxml(indent="\t") The idea is to print your Element in a string, parse it using minidom and convert it again in XML using the toprettyxml function.

xml_cdata.py in WikiFiles/MgetTree – Marine Geospatial Ecology , 27, # We're replacing the _write method of the ElementTree class so that it would. 28, # recognize and correctly print out CDATA sections. 29, old_ElementTree  Output: timestamp value timestamp value, timestamp value, timestamp timestamp value timestamp value, timestamp value, timestamp The text attribute it handles it in both cases.

CDATA, Use of CDATA in program output[edit]. CDATA sections in XHTML documents are liable to be  Using lxml: from lxml import etree # create XML root = etree.Element('root') root.append(etree.Element('child')) # another child with text child = etree.Element('child') child.text = 'some text' root.append(child) # pretty string s = etree.tostring(root, pretty_print=True) print s

Comments
  • > I need to output XML that contains CDATA sections Why? It seems a strange requirment.
  • It's a requirement I have - chunks of CDATA are sometimes much more human-readable.
  • @bortzmeyer It's useful for adding HTML to KML (Google Maps XML files).
  • This does not seem possible anymore since the write method is not there, and the _serialize* functions are static
  • What should I do since I can't use _write? So that means I can't use xml.elementtree? This is terrible.
  • Thsio reciep won't work for Python 2.7 or 3.2 (and 3.3) - check @amaury's answer bellow. BAsically, teh new ElementTree does not have a "_write" method that can be overriden anymore.
  • There is a CDATA element for etree you can use directly. lxml.de/api/lxml.etree.CDATA-class.html
  • This is huge from the "don't roll your own XML parser" perspective.
  • @iny I think your lxml link is broken.
  • This shoudl work fro Python 2.7 as well - as the original recipe does not. I jsut came up with another thing that is mode complicated than this.
  • This needs updating to add the coding kwarg to the _serialize_xml def
  • for python 2.7 add an encoding arg to the serialize signature. change def _serialize_xml(write, elem, qnames, namespaces): to def _serialize_xml(write, elem, encoding, qnames, namespaces): change write, elem, qnames, namespaces) to write, elem, encoding, qnames, namespaces) change et.write(sys.stdout.buffer.raw, "utf-8") to et.write(sys.stdout, "utf-8")
  • Thanks for copying my solution and posting it as yours! Way to go! Good luck bro!
  • Thank you! Your solution works great for me in Python 3.4.3, and it's really interesting that you only posted it yesterday, and I need it today. Haven't tested in 3.5, but I guess it will break sooner or later still, probably in the next version. Sigh.