I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curl is usually in HTML. I figured that if I can strip out all of the HTML tags, I could display the resulting text as an error message.

I was thinking of something like this:

sed -E 's/<.*?>//g' <<<$output_text

But I get sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

If I replace *? with *, I don't get the error (and I don't get any text either). If I remove the global (g) flag, I get the same error.

This is on Mac OS X.

sed doesn't support non-greedy.



Maybe parser-based perl solution?

perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html

You must install the HTML::Strip module with cpan HTML::Strip command.


you can use an standard OS X utility called: textutil see the man page

textutil -convert txt file.html

will produce file.txt with stripped html tags, or

textutil -convert txt -stdin -stdout < file.txt | some_command

Another alternative

Some systems get installed the lynx text-only browser. You can use the:

lynx -dump file.html #or
lynx -stdin -dump < file.html

But in your case, you can rely only on pure sed or awk solutions... IMHO.

But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

perl -0777 -pe 's/<.*?>//sg'

because will remove the next (multiline and common) tag too:

>link text</a>

Code for GNU sed:

sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file

This might fail, you should better use a html-parsing tool.

import re notag = re.sub("<.*?>", " ", html). The drawback of this solution is that it doesn't remove javascript or

If you want to remove all HTML tags and also all script tags (and their contents), you can use the following:

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g' $file -i && sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' $file -i && sed -r '/^\s*$/d' $file -i

