How to unescape HTML entities but leave XML entities untouched?

html entity encode
how to decode html code
entity_decode
html_entity_encode
php htmlentities to utf 8
mb_convert_encoding decode in php
htmlentities not working
encode and decode in php w3schools

This is the input:

<div>The price is &lt; 5 &euro;</div>

It is a valid HTML but is not a valid XML (because &euro; is not declared in DTD). A valid XML would look like:

<div>The price is &lt; 5 &#8364;</div>

Can you recommend some Java library that can help me to unescape HTML entities and convert them to XML entities?

The list of all HTML named character references is available at http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json

If you can tolerate the occasional mistake, you could just go over that file and replace all named character references that are not allowed in stand-alone XML with the corresponding numeric character reference.

That simple approach can run into problems though if your input is HTML, not XHTML:

<script>var y=1, lt = 3, x = y&lt; alert(x);</script>

contains a script element whose content is not encoded using entities, so naively replacing the &lt; here will break the script. There are other elements like <xmp> and <style> that can have similar problems as will CDATA sections in foreign XML elements.

If you need a really faithful conversion, or if your HTML is messy, your best bet might be to parse the HTML to a DOM using something like nu.validator and then use How to pretty print XML from Java? to convert the DOM to valid XML.

Even if your input is XHTML, you might need to worry about character sequences that look like entities in CDATA sections. Again, parse and re-render might be your best option.

How to unescape HTML entities but leave XML entities untouched?, A valid XML would look like: <div>The price is < 5 €</div>. Can you recommend some Java library that can help me to unescape HTML entities and� import re, htmlentitydefs ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary.

Apache commons StringUtils.unescapeHTML would do. The XML APIs in general escape XML entities themself. So you set a DOM attribute or content text with & and generated is &amp;. You can leave the characters in UTF-8; no need to make numeric entities of them.

Of course you could also process the HTML DTD. This would fill in the characters too. This may take tens of seconds. Unfortunately there are very many entities, DTD includes and slow servers, so one would better make a local XML catalog or caching entity handler with those DTDs.

import org.apache.commons.lang.StringEscapeUtils;

    String html = "<div>The price is &lt; 5 &euro;</div>";
    String text = StringEscapeUtils.unescapeHtml(html);
    System.out.println("Text: " + text);

Output in a UTF-8 Linux:

Text: <div>The price is < 5 €</div>

This shows that attribute values and inner text should be handled piece wise.

java, java - How to unescape HTML entities but leave XML entities untouched? it valid html not valid xml (because € not declared in dtd). valid xml like: can recommend java library can me unescape html entities , convert� Basically I create a DOM element programmatically, assign the encoded HTML to its innerHTML and retrieve the nodeValue from the text node created on the innerHTML insertion. Since it just creates an element but never adds it, no site HTML is modified. It will work cross-browser (including older browsers) and accept all the HTML Character Entities.

Using apache commons lang 3, a class that only replaces the HTML-specific entities:

import org.apache.commons.text.translate.AggregateTranslator;
import org.apache.commons.text.translate.CharSequenceTranslator;
import org.apache.commons.text.translate.EntityArrays;
import org.apache.commons.text.translate.LookupTranslator;
import org.apache.commons.text.translate.NumericEntityUnescaper;


public class HtmlEscapeUtils {

  /**
   * @see {@link org.apache.commons.text.StringEscapeUtils#UNESCAPE_HTML4}
   */
  public static final CharSequenceTranslator UNESCAPE_HTML_SPECIFIC =
      new AggregateTranslator(
          new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
          new NumericEntityUnescaper());


  /**
   * @see {@link org.apache.commons.text.StringEscapeUtils#unescapeHtml4(String)}
   * @param input - HTML String with e.g. &quot; &amp; &auml;
   * @return XML String, HTML4 Entities replaced, but XML Entites remain (e.g. &quot; und &amp;)
   */
  public static final String unescapeHtmlToXml(final String input) {
    return UNESCAPE_HTML_SPECIFIC.translate(input);
  }

}

Unescape HTML entities (#49) � Issues � Yorick Peterse / oga � GitLab, require 'nokogiri' require 'oga' xml = '<content><div>OMG</div></content >' It looks like Nokogiri does unescape HTML entities in a node text by default. .com/questions/7756123/nokogiri-leaving-html-entities-untouched). The ampersand character (&) and the left angle bracket (<) must not� Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities. Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

htmlspecialchars_decode - Manual, htmlspecialchars_decode — Convert special HTML entities back to characters The string to decode. flags ENT_COMPAT, Will convert double-quotes and leave single-quotes alone. ENT_XML1, Handle code as XML 1. In consequence, the script-tags are untouched, and you've just opened yourself to XSS. There is� The following are 28 code examples for showing how to use html.entities.name2codepoint(). They are from open source Python projects. You can vote up the examples you like or vote down the ones you don't like. You may also check out all available functions/classes of the module html.entities, or try the search function .

html_entity_decode - Manual, html_entity_decode — Convert HTML entities to their corresponding characters for XML, this function does not decode named entities that might be defined in ENT_COMPAT, Will convert double-quotes and leave single-quotes alone. Ah, wait, I see. You want to unescape and then pretty print the results. Yes, in that case using HTMLDecode to turn the entities back into angle brackets etc and using XmlDocument to insert whitespace is probably the best you'll get. – technophile Feb 4 '10 at 22:21 |

Unescape HTML Entities in Python, def unescape(text): """Removes HTML or XML character references and entities from a text string. keep &, >, < in the source code. Python 3.4+ HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake.It will be removed from the language soon. Instead, use html.unescape():

Comments
  • Do you want to do this to a whole document, or just the entity text? Are you trying to read an HTML file in as XML? (if so, there is more than just entities to worry about)
  • can you give a practical Java example, that would work with my texts (see above)?