Parse HTML page to get contents of <p> and <b> tags

how to parse html file in java
html parser java
python extract text from html tag
java html parser example
html parser online
html parser javascript
beautifulsoup lxml
python read html file

There are lots of HTML pages which are structured as a sequence of such groups:

<p>
   <b> Keywords/Category:</b>
   "keyword_a, keyword_b"
</p>

The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.

How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).

from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
    print ''.join(node.findAll(text=True))

I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:

for node in soup.findAll('p'):
    print(node.text)
    # or: keywords = node.text.split(', ')
    # print(keywords)

Extract attributes, text, and HTML from elements: jsoup Java HTML , After parsing a document, and finding some elements, you'll want to get at the data inside those elements. Solution. To get the value of an attribute, use the Node.

You need to split your string which in this case is url with /

And then you can choose chunks you want

For example if url is https://some.page.org/year/0001 i use split function to split url with / sign

it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link

HTML Parser - Extract HTML information with ease, from bs4 import BeautifulSoup as bs # Load the HTML content html_file Let's scan the DOM tree for Javascript files, the script nodes: <script Let's imagine that we have an element (div or span) with the id 1234 :.

HTML Parser - Extract information from a LIVE website - DEV, At this point, we have the page content, let's inject the HTML into BeautifulSoup and get some information from the remote page. soup = bs(page.

Assuming for each block

<p>
   <b> Keywords/Category:</b>
   "keyword_a, keyword_b"
</p>

you want to extract keyword_a and keyword_b for each Keywords/Category. So an example would be:

 <p>
    <b>Mammals</b>
    "elephant, rhino"
 </p>
 <p>
    <b>Birds</b>
    "hummingbird, ostrich"
 </p>

Once you have the HTML code, you can do:

from bs4 import BeautifulSoup

html = '''<p>
    <b>Mammals</b>
    "elephant, rhino"
    </p>
    <p>
    <b>Birds</b>
    "hummingbird, ostrich"
    </p>'''

soup = BeautifulSoup(html, 'html.parser')

p_elements = soup.find_all('p')
for p_element in p_elements:
    b_element = soup.find_all('b')[0]
    b_element.extract()
    category = b_element.text.strip()
    keywords = p_element.text.strip()
    keyword_a, keyword_b = keywords[1:-1].split(', ')
    print('Category:', category)
    print('Keyword A:', keyword_a)
    print('Keyword B:', keyword_b)

Which prints:

Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich

html.parser — Simple HTML and XHTML parser, from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def HTMLParser instances have the following methods: Force processing of all buffered data as if it were followed by an end-of-file mark. The content of Internet Explorer conditional comments (condcoms) will also be sent to this method, so, for 

JSoup tutorial - HTML parsing in Java with JSoup, scrape and parse HTML from a URL, file, or string; find and extract data, using With the attr() method, we get the value of the content attribute.

3 Examples of Parsing HTML File in Java using Jsoup, It not only provides support to read and parse HTML documents but also content="text/html; charset=ISO-8859-1"> <title>Login Page</title> 

Parsing HTML with Python, If I could scan through all the HTML files for image references, then We have a menu.xml file that serves as the table of contents for the online 

Comments
  • It seems the data is inside p tags, but your code selects b tags. I think you should select p tags instead.
  • +1 for not using a regexp! (stackoverflow.com/questions/1732348/…)