Selecting and stripping img src in HTML string

Related searches

I'm interested in stripping the s3 credientials from image tags within a block of text that is represented as a string in python.

For each tag in the string (of which there can be many), I'd like to start at ".jpeg", end at the next instance of a quotation mark, and delete everything inbetween those locations.

For example, the following string:

<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>

Would become:

<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>

I'm struggling to figure out how to do this. Any help would be appreciated.

Thanks!

Regex is not the tool for the job. A more robust solution is using a HTML parser like BeautifulSoup to extract the src attribute of the img tag, and a URL parser to remove the query from the URL:

from bs4 import BeautifulSoup
from urllib.parse import urlsplit

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)

Output:

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>

Edit: if you have more than one img tag per string, you can use:

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
                <img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")

for img in soup.find_all('img'):
    img_url = img['src']
    new_url = urlsplit(img_url)._replace(query=None).geturl()
    img['src'] = new_url
print(soup)

This will update the src attribute of each img tag:

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>

HTML DOM Image src Property, htm"). Technical Details. Return Value: A String, representing the URL of the image. As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion.

Assuming the string is stored in s:

import re

re.sub('\.jpeg[^\"]+\"', '.jpeg', s)

This will look for areas that start with ".jpeg" and end with quotation marks and replace them with empty string.

How to extract img src and alt from html using PHP?, Selecting each image in that document. Selecting attribute and save it's content to a variable. Output as HTML img object or as plain values as� The src property sets or returns the value of the src attribute of an image. The required src attribute specifies the URL of an image. Note: The src property can be changed at any time.

Using re you can find and remove all between ? and "

 text = re.sub('\?[^"]+', '', text)

Example code

text = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
expected_result = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'

import re

result = re.sub('\?[^"]+', '', text)

print(result == expected_result) # True

EDIT: if there is text with ? and " then you can add more elements in regex

result = re.sub('\.jpeg\?[^"]+', '.jpeg', text)

HTML 5.2: 4.7. Embedded content, This is usually referred to as image format-based selection . The src and srcset attributes on the img element can be used, using the x descriptor, no space characters after the URL, the following image candidate string, if there is one, User agents must remove entries in the list of available images as� The src attribute identifies an image by a URL. The image defined by the URL is retrieved by the browser and inserted into the document when the page loads. There are three different kinds of URLs that can be used in the src attribute: Absolute URLs

Use BeautifulSoup to parse the html and then use urlparse

Ex:

from bs4 import BeautifulSoup
try:
    from urllib.parse import urlparse #python3
except:
    from urlparse import urlparse #python2


html = """<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>"""
soup = BeautifulSoup(html, "html.parser")

for img in soup.find_all("img"):   #Find all img tags
    o = urlparse(img["src"])       #Get URL
    print(o.scheme + "://" + o.netloc + o.path)

Output:

https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg

FileReader.readAsDataURL(), Skip to main content; Select language; Skip to search To retrieve only the Base64 encoded string, first remove data:*/*;base64, from addEventListener(" load", function () { // convert image file to base64 string preview.src = reader.result ; } HTML. <input id="browse" type="file" onchange="previewFiles()"� alt - Specifies an alternate text for the image, if the image for some reason cannot be displayed Note: Also, always specify the width and height of an image. If width and height are not specified, the page might flicker while the image loads.

how to add img src in html Code Example, Get code examples like "how to add img src in html" instantly right from your google search results with the Grepper Chrome Extension. Extraction of image attributes like ‘src’, ‘alt’, ‘height’, ‘width’ etc from a HTML page using PHP. This task can be done using the following steps. Loading HTML content in a variable(DOM variable). Selecting each image in that document. Selecting attribute and save it’s content to a variable.

Two options to specify a source. You may use absolute or relative paths to specify the source of the image in HTML img src attribute.. The absolute path. In this option, the complete URL of the image is specified in the src attribute of HTML img tag.

In this article, you will find 3 ways to strip the html tags from a string in Javascript. 1. Create a temporary DOM element and retrieve the text. This is the preferred (and recommended) way to strip the HTML from a string with Javascript.

Comments
  • What have you tried so far ?
  • Why don't u split it at "?" and then get the first item from the list using index 0?
  • I think that I'd have to split at <img to start, right?
  • Is this part of a bigger xml @JasonHoward ? If yes you can use xml parsers to make your life easy!
  • Nope, it's not. it's basically just the contents of a short blog post.
  • don't really want the extra tags (html and body) added to the string. How can we prevent this? Thanks
  • Edited my answer to address both of those issues.
  • This is represented as a string. The fact that it contains html shouldn't matter. I'll try this.
  • Is there any way to modify this so that we first look for the present of an image tag an only modify the contents of that?
  • We could define a function that checks if the string contains "<img>" or not and then perform the replacement? Is that something you are looking for? That function will just be needed to pass instead of the replacement string.
  • @JasonHoward The fact that you care about html tags means that the text being html is relevant.
  • @JasonHoward the topic of parsing html with regex has passed into stack overflow folk lore
  • @glhr regex is not right tool for parsing HTML but this problem doesn't need to parse all HTML
  • What if the text in the body contains ? and "?