Parsing XML using PHP - Which includes ampersands and other characters

php convert special characters to html entities
php special characters list
php htmlspecialchars_decode
php convert special characters to unicode
special character issue in php
htmlspecialchars vs htmlentities
why use htmlentities php
php xml addchild

I'm trying to parse an XML file and one of the fields looks like the following:

<link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>

This seems to break the parser. i think it might be something to do with the & in the link?

My code is quite simple:

<?

$xml = simplexml_load_file("files/this.xml");

echo $xml->getName() . "<br />";

foreach($xml->children() as $child) {
  echo $child->getName() . ": " . $child . "<br />";
}
?>

any ideas how i can resolve this?

The XML snippet you posted is not valid. Ampersands have to be escaped, this is why the parser complaints.

Microshell » XML and Ampersand (&), But it's always a good thing to make sure that they (including me) do. The ampersand character (&) and the left angle bracket (<) MUST NOT appear in If you're using PHP's SimpleXML, you may get error message something like: In other words, when a parser sees ampersand (&), it expects to see  Anytime I try to extract the title of an article that has special characters, such as & or ', it's only returning the letters after the special characters. single quotes are displaying as ' (in xml) double quotes are displaying as " (in xml) ampersands are displaying as &amp; (in xml) These 3

Your XML feed is not valid XML : the & should be escaped as &amp;

This means you cannot use an XML parser on it :-(

A possible "solution" (feels wrong, but should work) would be to replace '&' that are not part of an entity by '&amp;', to get a valid XML string before loading it with an XML parser.

In your case, considering this :

$str = <<<STR
<xml>
  <link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>
</xml>
STR;

You might use a simple call to str_replace, like this :

$str = str_replace('&', '&amp;', $str);

And, then, parse the string (now XML-valid) that's in $str :

$xml = simplexml_load_string($str);
var_dump($xml);

In this case, it should work...

But note that you must take care about entities : if you already have an entity like '&gt;', you must not replace it to '&amp;gt;' !

Which means that such a simple call to str_replace is not the right solution : it will probably break stuff on many XML feeds !

Up to you to find the right way to do that replacement -- maybe with some kind of regex...

Using simplexml_load_string - PHP, I am using a third party ip lookup service which returns an xml formatted Does anyone know why or what other characters will cause this to fail? able to parse the elements in xml, and the string file contains an ampersand,  This is accomplished using the ampersand. When the XML parser finds an ampersand in the XML data, it expects to find a symbol name and a semicolon following it. The symbol name provides a symbolic reference to another entity or character such as the ampersand, greater-than, and less-than characters.

It breaks the parser because your XML is invalid - & should be encoded as &amp;.

htmlspecialchars - Manual, Certain characters have special significance in HTML, and should be instance, to ensure the well-formedness of XML documents with embedded external content. as the characters affected by htmlspecialchars() occupy the same positions in all If the input string contains an invalid code unit sequence within the given  parser. A reference to the XML parser to use. data. Chunk of data to parse. A document may be parsed piece-wise by calling xml_parse() several times with new data, as long as the is_final parameter is set and TRUE when the last data is parsed. is_final. If set and TRUE, data is the last piece of data sent in this parse.

If your XML already has some escaping, this way it will be preserved and unescaped ampersands will be fixed:

$brokenXmlText = file_get_contents("files/this.xml");
$fixed = preg_replace('/&(?!lt;|gt;|quot;|apos;|amp;|#)/', '&amp;', $brokenXmlText);
$xml = simplexml_load_string($fixed);

SimpleXMLElement::addChild - Manual, SimpleXMLElement::addChild — Adds a child element to the XML node may include example.php, which refers to the XML string found in the first example of the <movies type="documentary"> <movie> <title>PHP: Behind the Parser</​title> Or, use htmlspecialchars() which also replaces other characters, but won't do  Dealing with XML errors. Dealing with XML errors when loading documents is a very simple task. Using the libxml functionality it is possible to suppress all XML errors when loading the document and then iterate over the errors.

The comment by mjv resolved it:

Alternatively to using &, you may consider putting the urls and other XML-unfriendly content in , i.e. a Character Data block

Dealing with XML errors - Manual, Extensions · File System Related Extensions · Human Language and Character Encoding Using the libxml functionality it is possible to suppress all XML errors when The libXMLError object, returned by libxml_get_errors(), contains several you can pre-parse the string to ecsape the unescaped ampersands without  The reason your script is slow is simply your use of xml_parse_into_struct - it reads the whole XML-string and doesn't return until it has parsed and validated it all. If you're looking for efficiency, you'll have to use the more low-level xml_parser_create, xml_set_*_handler functions.

simplexml_load_file - Manual, The file test.xml contains an XML document with a root element either you have to change the : to other special characters like '-' in order to convert it to parse XML with the likes of simplexml_load_file / simplexml_load_string. file in question does not contain an ampersand (&) without a corresponding entity reference. Using PHP’s SimpleXML extension that was introduced back in PHP 5.0, working with XML is very easy to do. Parsing XML With SimpleXML. Now that you know how to use SimpleXML to parse XML

Ampersands, PHP Sessions and Valid HTML, W3C QA - Why using PHP sessions causes invalid HTML and In HTML (and XHTML, along with other SGML and XML applications) certain characters have special HTML and XHTML include blocks of what is called CDATA, where as such then rogue ampersands will cause the XML parser to give up  Parsing an XML File Using SAX In real-life applications, you will want to use the SAX parser to process XML data and do something useful with it. This section examines an example JAXP program, SAXLocalNameCount , that counts the number of elements using only the localName component of the element, in an XML document.

Flex 3 Bible, character is included in the <name> element: <name>Ma & Pa Kettle</name> A truly robust XML serializer would replace the ampersand with its equivalent entity​: including those that are embedded in the common Web browsers, fail to read this idI "phpService " urlI "ReturnSimpleXML . php " /> Retrieving XML data with​  This first article of a three-part series introduces PHP5's XML implementation and helps those relatively new to using XML with PHP to read, parse, and manipulate, and write a short and uncomplicated XML file using the DOM and SimpleXML in a PHP environment.

Comments
  • I can't change the feed unfortunately, so there is no other way other than regex to do this?
  • Alternatively to using &amp;, you may consider putting the urls and other XML-unfriendly content in <![CDATA[your_stuff_here]]>, i.e. a Character Data block
  • I can't change the feed unfortunately, so there is no other way other than regex to do this? Will this: <link><![CDATA[link]]></link> fix it? (if i can change the file?)
  • Mjv - if you want to place your comment in the form of an answer i will accept as it's made life easier now and my xml "valid"...