How to extract text in-between 2 different closed html tags that are not inside the tags?

beautifulsoup findall
beautifulsoup get text inside tag
beautifulsoup find text
beautifulsoup findall multiple tags
tag content extractor in java
beautifulsoup find text in tag
beautiful soup documentation pdf
the find() method in beautifulsoup

On a web-page with many b tags with the same class names I want to extract the text between 2 different closed html 'b' tags specifically these b tags

 <b style="display:block">Print Method:</b>
 "
                                On-demand inkjet (piezoelectric)"
<b style="display:block">Minimum Ink Droplet Volume:</b>

I tried to use the beautiful soup library to get the data by creating a table, using findALL.

b.text

It prints all the text from all the b tags is there anyway I can get only the text in between those tags.

Here's the web-site where i'm getting the HTML from.


There is a much easier way for the example you show. Use :contains with bs4 4.7.1+

import requests 
from bs4 import BeautifulSoup

r = requests.get('https://www.epson.co.in/For-Home/Printers/EcoTank-Printers/EcoTank-L1110-Single-function-InkTank-Printer/p/C11CG89504')
soup = bs(r.content, 'lxml')
soup.select_one('b:contains("Print Method:")').next_sibling.strip()

You could also have done:

print(soup.select_one('.product-classifications b:has(+b)').next_sibling.strip())

Web Information Systems Engineering -- WISE 2014: 15th , The Information Extractor: This module extracts information from a website to represent it as HTML inside <title> and <body> is rendered visually in the browser, whereas A number of properties such as meta fields inside the head tag, title, headings in the text inside the <body> are extensively used for SEO [2][ 20]. Extract text from txt file and save as new *.txt ; Unable to set the RowHeight property of the range class ; Python count html tags ; Tags for the Text Widget


See below (Note that the code is not very efficient since it scans every entry in the document)

from bs4 import BeautifulSoup

html = ''' <b style="display:block">Print Method:</b>
 "
                                On-demand inkjet (piezoelectric)"
<b style="display:block">Minimum Ink Droplet Volume:</b>'''

soup = BeautifulSoup(html, 'html.parser')
idx_lst = []
data_idx = -1
for idx, entry in enumerate(soup.contents):
    if entry.name == 'b':
        idx_lst.append(idx)
        if len(idx_lst) == 2:
            if idx_lst[1] - idx_lst[0] == 2:
                data_idx = idx_lst[0] + 1
                break
            else:
                idx_lst = []

if data_idx != -1:
    print(soup.contents[data_idx])

output

 "
                                On-demand inkjet (piezoelectric)"

The code below handles the real HTML

import requests
from bs4 import BeautifulSoup

URL = 'https://www.epson.co.in/For-Home/Printers/EcoTank-Printers/EcoTank-L1110-Single-function-InkTank-Printer/p/C11CG89504'

findings = set()
r = requests.get(URL)
if r.status_code == 200:
    soup = BeautifulSoup(r.text, 'html.parser')
    idx_lst = []
    data_idx = -1
    b_lst = soup.find_all('b', style='display:block')
    for entry in b_lst:
        for idx, x in enumerate(entry.parent.contents):
            if x.name == 'b' and idx not in idx_lst:
                idx_lst.append(idx)
            if len(idx_lst) == 2:
                if idx_lst[1] - idx_lst[0] == 2 or idx_lst[1] - idx_lst[0] == 3:
                    data_idx = idx_lst[0] + 1
                    findings.add(entry.parent.contents[data_idx].strip())
                    idx_lst = []
                else:
                    idx_lst = []

for idx, p in enumerate(findings, 1):
    print('{}) {}'.format(idx, p))

output

1) 215.9 x 1200 mm (8.5 x 47.24")
2) 1
3) ESC / P-R
4) 5760 x 1440 dpi (with Variable-Sized Droplet Technology)
5) Friction feed
6) Sound Power Level (Black / Colour): 6.6 B(A) / 6.3 B(A)
7) 180 nozzles Black, 59 nozzles per colour (Cyan, Magenta, Yellow)
8) On-demand inkjet (piezoelectric)
9) Bi-directional printing
10) Up to 33 ppm / 15 ppm
11) Legal, Indian-Legal (215 x 345 mm), 8.5 x 13", Letter, A4, 16K (195 x 270 mm), B5, A5, B6, A6, Hagaki (100 x 148 mm), 5 x 7", 4 x 6", Envelopes: #10, DL, C6
12) 3 pl

Beautiful Soup documentation, Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs) For instance, <TD> tags go inside <TR> tags, not the other way around. HTML has a fixed set of self-closing tags, but with XML it depends on what the DTD says. (1) Add * between the two specified marks that you will extract text between, and type them into the Text box. For example, if you want to extract text between both comma, type,*, into Text box. Note: The * represent any series of characters. (2) Click the Add button.


I guess something like that is a bit more readable. Filter simply with the attributes and extract the text.

from bs4 import BeautifulSoup
id_soup = BeautifulSoup('<b style="my id">  tralala </b>')

if id_soup.b['style']=='my id':
      print(id_soup.text )

Hope it helps :)

Grep text between html tags, TAG> matches the opening and closing pair of a specific HTML tag. Learn more Get text inside xml tag using grep [duplicate] Stack Overflow for Teams 2) which grep will not work across lines, so HTML tags that cross multiple lines of Dec 28, 2012 · extract the contents in between the tags Friday, using grep/sed/ awk. Basically, the BeautifulSoup's text attribute will return a string stripped of any HTML tags and metadata. Finding a tag with find() Generally, we don't want to just spit all of the tag-stripped text of an HTML document. Usually, we want to extract text from just a few specific elements. Let's re-use our "complicated" HTML string from above:


Getting Started with Beautiful Soup, Chapter 2. We can use Beautiful Soup to extract anydatain an HTML/XML document, for example, to get alllinks in a pageorto get text inside tags on the page. documentis converted to different Beautiful Soup objects,and based onthe different properties and methods of these objects,we can extractthe required data. I'm not sure what you mean by 'specs' or 'spec-style' but note that your web browser uses an html parser and an html parser will parse html regardless of how it is written. It will not parse things that are not html, but then, neither will your browser, so no one would bother writing "html" that a parser cannot parse.


Harvesting the web with rvest • rvest, Most HTML tags come in pairs and consist of opening and a closing tag, known as It is possible to define HTML attributes inside HTML tags. about HTML elements, such as hyperlinks for text, and width and height for images. They can be used not only for styling, but also for extracting the content of these elements. HTML elements are written with a start tag, an end tag, and with the content in between: <tagname>content</tagname>. The tags which typically contain the textual content we wish to scrape, and the tags we will leverage in the next two sections, include:


Tag Content Extractor Discussions | Java, matches all the text in between the HTML start and end tags. We place a special The characters inside the parenthesis are saved into Group #2. </\\1>. Extract part string between two different characters with formulas. To extract part string between two different characters, you can do as this: Select a cell which you will place the result, type this formula =MID(LEFT(A1,FIND(">",A1)-1),FIND("<",A1)+1,LEN(A1)), and press Enter key. Note: A1 is the text cell, > and < are the two characters you