How do I preserve line breaks when using jsoup to convert html to plain text?

jsoup outputsettings
jsoup html to plain text
jsoup get all text
jsoup preserve whitespace
jsoup whitelist
java html to plain text
jsoup minify html
jsoup escape html

I have the following code:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

And I have the result:

hello world yo googlez

But I want to break the line:

hello world
yo googlez

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).

java How do I preserve line breaks when using jsoup to convert html , Reference:How do I preserve line breaks when using jsoup to convert html to plain text? 12345678910public static String br2nl(String html)  Reference:How do I preserve line breaks when using jsoup to convert html to plain text Reference:How do I preserve line breaks when using jsoup to convert

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

Jsoup preserve new lines example, java How do I preserve line breaks when using jsoup to convert html to plain text​? public static String br2nl(String html) { if(html==null) return html; Document  Removing HTML entities while preserving line breaks with JSoup I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem. I can use Node.html() to return the full HTML of the desired node, which retains line breaks as such: Glóandi augu, silfurnátt <br />Blóð alvö

With

Jsoup.parse("A\nB").text();

you have output

"A B" 

and not

A

B

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

java, Jsoup removes the newline character “\n” by default from the HTML. String str = Jsoup.parse(strHTML).text();. System.out.println(str);. Output. 1. Hello World. As you can see from the output, Jsoup replaced “\n” with a space  How do I preserve line breaks when using jsoup to convert html to plain text? (10) Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document.

Try this by using jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

Parsing a body fragment: jsoup Java HTML parser, How do I preserve line breaks when using jsoup to convert html to plain text? use it. If there's a <br> in the markup I parse, how can I get a line break in my resulting output? prettyPrint(false));//makes html() preserve linebreaks and spacing  This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br>and <p> with line breaks.

On Jsoup v1.11.2, we can now use Element.wholeText().

Example code:

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves the alignment of texts.

JSoup Tip How to get raw element text with newlines in Java , I preserve line breaks when using jsoup to convert html to plain text? - public string notags(string str){ return jsoup.parse(str).text(); } public  Jsoup preserve new lines example shows how to preserve new lines while using Jsoup to parse HTML. Example also shows how to preserve newlines characters having , <br> and <p> tags. How to preserve new lines while using Jsoup? Jsoup removes the newline character “ ” by default from the HTML. It also does not retain new lines created by

org.jsoup.nodes.Document$OutputSettings.<init> java code , Guide to parsing user-supplied HTML in Java, and keeping safe from cross-site a couple of p tags; as opposed to a full HTML document) that you want to parse. cleaner, and clean the input with clean(String bodyHtml, Whitelist whitelist) . Use selector-syntax to find elements · Extract attributes, text, and HTML from  I need to convert HTML string to plain text (preferably using HTML Agility pack). With proper white-spaces and, especially, proper line-breaks. And by "proper line-breaks" I mean that this code:

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn't care about those so JSOUP  Extract attributes, text, and HTML from elements Problem. After parsing a document, and finding some elements, you'll want to get at the data inside those elements. Solution. To get the value of an attribute, use the Node.attr(String key) method; For the text on an element (and its combined children), use Element.text()

Best Java code snippets using org.jsoup.nodes. How do I preserve line breaks when using jsoup to convert html to plain text? public static String br2nl(String  In this post I will show you how I get plain text from HTML. If you wish to see the entire code with the libraries, you can view it on github. Step 1: Strip HTML tags while keeping the line breaks with Jsoup; I browsed through several of stackoverflow answers and tried some of them. This solution worked best for me.

Comments
  • edit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right.
  • I asked the same question (without the jsoup requirement) but I still do not have a good solution: stackoverflow.com/questions/2513707/…
  • see @zeenosaur 's answer.
  • This should be the selected answer
  • br2nl is not the most helpful or accurate method name
  • This is the best answer. But how about for (Element e : document.select("br")) e.after(new TextNode("\n", "")); appending real newline and not the sequence \n? See Node::after() and Elements::append() for the difference. The replaceAll() is not be needed in this case. Similar for p and other block elements.
  • @user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean.
  • See github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/… for a comprehensive answer to this problem.
  • This should be the only correct answer. All others assume that only br tags produce new lines. What about any other block element in HTML such as div, p, ul etc? All of them introduce new lines too.
  • With this solution, the html "<html><body><div>line 1</div><div>line 2</div><div>line 3</div></body></html>" produced the output: "line 1line 2line 3" with no new lines.
  • This doesn't work for me; <br>'s aren't creating line breaks.