java regex to retrieve link from text

regular expression html link
how to get href value in java
extract url from string java
regular expression hyperlink
get all links from a website java
java substring regex
url regex
java string matches regex

I have a input String as:

String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";

I want to convert this text to:

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it

So here:

1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.

2) If the URL is relative, I want to prefix the base URL (http://www.google.com).

3) I want to append a parameter to the URL. (&myParam=pqr)

I am having issues retrieving the tag with URL and label, and replacing it.

I wrote something like:

public static void main(String[] args) {
    String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
    text = text.replaceAll("&lt;", "<");
    text = text.replaceAll("&gt;", ">");
    text = text.replaceAll("&amp;", "&");

    // this is not working
    Pattern p = Pattern.compile("href=\"(.*?)\"");
    Matcher m = p.matcher(text);
    String url = null;
    if (m.find()) {
        url = m.group(1);

    }
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
    URI oldUri = new URI(uriToUpdate);
    String newQueryParams = oldUri.getQuery();
    if (newQueryParams == null) {
        newQueryParams = queryParamsToAppend;
    } else {
        newQueryParams += "&" + queryParamsToAppend;  
    }
    URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
            oldUri.getPath(), newQueryParams, oldUri.getFragment());
    return newUri;
}

Edit1:

Pattern p = Pattern.compile("HREF=\"(.*?)\"");

This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.

Also, how do I handle if my text has several URLs.

Edit2:

Some progress.

Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
  url = m.group(1);
  System.out.println(url);
}

This handles the case of multiple URLs.

Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.

Edit3:

By multiple URL cases, I mean there are multiple url present in given text.

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
 url = m.group(1); // this variable should contain the link URL
 url = appendBaseURI(url);
 url = appendQueryParams(url, "license=ABCXYZ");
 System.out.println(url);
}
public static void main(String args[]) {
    String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href=\"(.*?)\">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
}

private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
        if (url.startsWith("/")) {
            url = "http://www.google.com" + url;
        } else {
            url = "http://www.google.com/" + url;
        }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
}

Output

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text

Extract HTML Links with Java Regular Expression example, Extract the value of the href attribute; Extract the text of the a HTML link element. We are going to work with groups. In our regular expression we  ReplaceUrls() uses a simple Regex to find the start of URLs, and one of two methods to find the end of each URL based on whether or not the URL appears to be wrapped in a delimiter pair. The GetUrlDelimiter() method inspects the text to determine if the URL is wrapped in a delimiter pair.

You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:

import org.apache.commons.text.StringEscapeUtils;

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+\"(.*?)\">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it

Demos:

  1. jdoodle
  2. Regex Explanation

How to extract HTML Links with regular expression, Here's a simple Java Link extractor example, to extract the a tag value html * html content for validation * @return Vector links and link text  Nice regular expressions for finding URLs from plain text. But if I want to make following url clickable then what would be the regex for the same. e.g. "google.com, yahoo.co.in" I tried hard to make a regex for the same but not yet succeded.

// this is not working

Because your regex is case-sensitive.

Try:-

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);

Edit1: To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).

Edit2: To replace the tag (including label) with your final string, use:-

text.replaceAll("(?i)<a href=\"(.*?)</a>", "new substring here")

Java code to get URL from a string, This little code snippet / function will effectively extract URL strings from a string in Java. I found the basic regex for doing it here, and used it in a java function. I expanded on private ArrayList pullLinks(String text) {. ArrayList  java.util.regex.Pattern class: This class is a compilation of regular expressions that can be used to define various types of patters, providing no public constructors. This can be created by invoking the compile() method which accepts a regular expression as the first argument, thus returns a pattern after execution.

Almost there:

public static void main(String[] args) throws URISyntaxException {
        String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
        text = StringEscapeUtils.unescapeHtml4(text);
        System.out.println(text);
        System.out.println("**************************************");
        Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
        Pattern patternLink = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher matcherTag = patternTag.matcher(text);

        while (matcherTag.find()) {
            String href = matcherTag.group(1); // href
            String linkText = matcherTag.group(2); // link text
            System.out.println("Href: " + href);
            System.out.println("Label: " + linkText);
            Matcher matcherLink = patternLink.matcher(href);
            String finalText = null;
            while (matcherLink.find()) {
                String link = matcherLink.group(1);
                System.out.println("Link: " + link);
                finalText = getFinalText(link, linkText);
                break;
            }
            System.out.println("***************************************");
            // replacing logic goes here
        }
        System.out.println(text);
    }

    public static String getFinalText(String link, String label) throws URISyntaxException {
        link = appendBaseURI(link);
        link = appendQueryParams(link, "myParam=ABCXYZ");
        return link + " (" + label + ")";
    }

    public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
        URI oldUri = new URI(uriToUpdate);
        String newQueryParams = oldUri.getQuery();
        if (newQueryParams == null) {
            newQueryParams = queryParamsToAppend;
        } else {
            newQueryParams += "&" + queryParamsToAppend;  
        }
        URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
                oldUri.getPath(), newQueryParams, oldUri.getFragment());
        return newUri.toString();
    }

    public static String appendBaseURI(String url) {
        String baseURI = "http://www.google.com/";
        if (url.startsWith("/")) {
            url = url.substring(1, url.length());
        }
        if (url.startsWith(baseURI)) {
            return url;
        } else {
            return baseURI + url;
        }
    }

How to extract HTML Links with regular expression in java, How to extract HTML Links with regular expression in java. Extract A tag Regular Expression Pattern String linkText = matcherTag.group(2); // link text. Here’s a simple Java Link extractor example, to extract the a tag value from 1st pattern, and use 2nd pattern to extract the link from 1st pattern. HTMLLinkExtractor.java

Open Sourcing URL-Detector: A Java Library to Detect and , A Java Library to Detect and Normalize URLs in Text to check URLs for bad content at this scale, we need to be able to extract URLs in text at scale. Initially​, we started out with a solution based on regular expressions. Problem: In a Java program, you need a way to extract multiple groups (regular expressions) from a given String. Solution: Use the Java Pattern and Matcher classes, and define the regular expressions (regex) you need when creating your Pattern class. Also, put your regex definitions inside grouping parentheses so you can extract the actual text that matches your regex patterns from the String.

A JavaScript function to extract text from an anchor tag (using a regex), A JavaScript function to extract text from an anchor tag (using a regex) ExtJS application that extracts the text from an anchor tag (hyperlink): text · Java: How to extract an HTML tag from a String using Pattern and Matcher. Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, Python, Bootstrap, Java and XML.

Java: How to extract an HTML tag from a String using Pattern and , Solution: Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want  Java provides the java.util.regex package for pattern matching with regular expressions. Java regular expressions are very similar to the Perl programming language and very easy to learn. A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

Comments
  • Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );
  • oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!
  • This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.
  • make baseurl also dinamic.
  • Thanks. Just found out this. Have edited the question for the same.
  • So this doesn't answer your question? If not, what's the next issue?
  • 3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.
  • 1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll
  • edited. I did not understand the replace part. What do you mean "like you replaced before" ?