Combining regex with html tags

regex html tag content
regex match html tag with attributes
regular expression for html tags in javascript
regex cheat sheet
regex remove spaces between words
regex tutorial
regular expression examples
complex regular expression examples

I have the following text from html page:

page = 
"""
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms "GE" and "GECC" on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. "Financial Statements and Supplementary Data" of this Form 10-K Report. Also, unless otherwise indicated by the context, "General Electric" means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

I want to find obtain the text between Item 1 Business and Item 1A Risk factors. I cannot use beautifulsoup because each page has a different html tag structure. I use the following code to get the text, but it does not work:

regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.',   #<===pattern 1: with an attribute bold before the item subtitle
              'b>\s*Item 1\.(.+?)b>\s*Item 1A\.',               #<===pattern 2: with a tag <b> before the item subtitle
              'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>',         #<===pattern 3: with a tag <\b> after the item subtitle          
              'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle 

for regex in regexs:
    match = re.search(regex, page, flags=re.IGNORECASE|re.DOTALL)  #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
    if match:
        soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?) 

            #soup.text removes the html tags and only keep the texts
            #rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
        rawText = soup.text
        print(rawText)
        break

The expected output is:

Unless otherwise indicated by the context, we use the terms "GE" and "GECC" on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. "Financial Statements and Supplementary Data" of this Form 10-K Report. Also, unless otherwise indicated by the context, "General Electric" means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

I think, the first regex should match the pattern but it does not

EDIT: Here is the actual htm page and way to retrieve the text:

# Import the libraries
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
response = requests.get(url, headers=HEADERS)
print(response.status_code)

page = response.text
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip()  #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace('&nbsp;', ' ') #<===replace "&nbsp;" (a special character for space in HTML) with space. 
page = page.replace('&#160;', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'\xa0', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'/s/', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
while '  ' in page:
    page = page.replace('  ', ' ') #<===remove extra space

What if you change your regex:

regexs = ('Item 1\.\s*Business\/(.*)',
          'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b')

Does it work?

Combining regex with html tags, 问题: I have the following text from html page: page = """ Item 1. Busine- HelloJava菜鸟社区. ) </ \1 > will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselves. Trimming Whitespace

Something like as follows?

import re
page =  """
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms "GE" and "GECC" on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. "Financial Statements and Supplementary Data" of this Form 10-K Report. Also, unless otherwise indicated by the context, "General Electric" means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

data = re.search('Item 1\. Business\/<\/font> (.*)(<font(.*)">Item 1A. Risk Factors)', page, flags=re.DOTALL).group(1)
print(data)

Regular Expression Examples, RegexBuddy—The most comprehensive regular expression library! Grabbing HTML Tags Do both by combining the regular expressions into ^[ \t]+|[ \t]+$. Combining regex with html tags. each page has a different html tag structure. content by removing extra white space and combining then into one line. page

I would first "parse" the HTML by greedily isolating all sequences of the type

<font[^>]*>([^<>]*)</font>([^<>]+)

which would give me something like,

( 'Item 1. Business/', 'Unless otherwise indicated ... CT 06828-0001.' ),
( 'Item 1A. Risk Factors', '...')

and takes care of the problem indicated by your comment "sometimes "Item 1 Business" and "Item 1A Risk Factors" are used within the text". Here, the text can only be the second element of each tuple and you basically ignore that altogether.

Then I would check what is in the first element of each match to recognize "Item 1." vs "Item 1A.". The capture cycle would start as soon as it found the first keyword, skipping the keyword itself, and stop on finding the second.

Using a Regular Expression to Match HTML, Reading it will make your Regex-Fu powerful. So let's look at a common task of matching HTML tags within the body of some text. When you� Here is how TextAngular (WYSISYG Editor) is doing it. I also found this to be the most consistent answer, which is NO REGEX. @license textAngular Author : Austin Anderson License : 2013 MIT Version 1.5.16 // turn html into pure text that shows visiblity function stripHtmlToText(html) { var tmp = document.createElement("DIV"); tmp.innerHTML = html; var res = tmp.textContent || tmp.innerText

Sooo, I tried NOT TO USE "<font>" in the regex because you said it could vary so I hope this works. In your scenario, though, there are many ways to break the regex because XML in many cases, definitely in your case, shouldn't really be parsed using regex

>>> import re



>>> string  = '''
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms "GE" and "GECC" on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. "Financial Statements and Supplementary Data" of this Form 10-K Report. Also, unless otherwise indicated by the context, "General Electric" means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>'''




>>> result = re.findall('Item[\s]*1.[\s]*Business[/<]*[\S]*?[>]*[\s]+([\S\s]+?)[/<]+[\S\s]*?[>]*?Item 1A. Risk Factors', string)





#Output
>>> print(result[0])
Unless otherwise indicated by the context, we use the terms "GE" and "GECC" on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. "Financial Statements and Supplementary Data" of this Form 10-K Report. Also, unless otherwise indicated by the context, "General Electric" means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

Regular expressions, Regular expressions are a powerful means for finding character sequences in text. If you think quantifiers are fun, it's time to combine them with character sets . If you want to find tags with regular expressions, you can use three special� You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use < [^>]*> sed -e 's/< [^>]*>//g' file.html If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines

So, you're likely in for a world of hurt given the full text of the page. In all honesty, your description of the problem is very misleading, but anywho, this might be what you're looking for, BUT IT'S MASSIVE

>>> import re
>>> import requests


>>> page = requests.get("https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm").text



>>> segment_of_page = re.findall('(?i)align=[\"]*center[\"]*[\S\ ]+?Part[\s]*I(?!I)[\S\s]+?Item[\S\s]*?1(?![\d]+)[\S\s]{1,50}Business[\S\s]{40,}?>Item[\S\s]{1,50}1A\.[\S\s]{1,50}(?=Risk)', page)



>>> parsed_data_sets = []



>>> for i in range(len(segment_of_page)):
        if len(segment_of_page[i]) > 35:
            parsed_data = re.findall('(?:<[\S\s]+?>)+([\S\s]+?)(?=<[\S\s]+?>)+', segment_of_page[i])
            for ii in range(len(parsed_data)):
                parsed_data_sets.append(parsed_data[ii])


>>> for i in range(len(parsed_data_sets)):
        if len(parsed_data_sets[i]) > 35:
            print('\n\n\n===============\n\n')
            print(parsed_data_sets[i])





#Output
===============


Unless otherwise indicated by the context, we use the terms &#8220;GE&#8221; and &#8220;GECC&#8221; on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K Report. Also, unless otherwise indicated by the context, &#8220;General Electric&#8221; means the parent company, General Electric Company (the Company).


===============


General Electric&#8217;s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.


===============


We are one of the largest and most diversified infrastructure and financial services corporations in the world. With products and services ranging from aircraft engines, power generation, oil and gas production equipment, and household appliances to medical imaging, business and consumer financing and industrial products, we serve customers in more than 100 countries and employ approximately 305,000 people worldwide. Since our incorporation in 1892, we have developed or acquired new technologies and services that have broadened and changed considerably the scope of our activities.


===============

Some of the document changed since you last extracted a string, but let me know if this works.

[PDF] Regular Expressions: The Complete Tutorial, want to match a pair of opening and closing HTML tags, and the text in between. that is not a combining mark, while �\p{M}*� matches zero or more code� However, Would it be possible compose regex that captures tags, not a text? – Yuriy Samorodov Aug 23 '13 at 12:38 I don't exactly understand what you would like to capture.

Regular Expressions :: Eloquent JavaScript, It can be either constructed with the RegExp constructor or written as a literal value by enclosing a pattern in forward slash ( / ) characters. edit & run code by� The Regex I had developed before was more cumbersome, then Chris made a suggestion, so I will now go further with the regex suggested by Chris that is a "\<[^\>]*\>". I have tested it for many cases. It detects all types of HTML tags, but there may be loopholes inside so if you find any tags which are not passing through this Regex , then

UTS #18: Unicode Regular Expressions, This Version, https://www.unicode.org/reports/tr18/tr18-21.html Unicode is a large character set—regular expression engines that are Similarly, the terms “ string” and “sequence of code points” are used interchangeably. (the last three character being U+0300 ( ◌̀ ) COMBINING GRAVE ACCENT,� I have some pretty basic Regex that scans the output of a HTML file (the whole document source) and attempts to extract all of the absolute links that look like images. Whether they are actually images or not isn't too important, as those checks would be made later.

4. Pattern Matching with Regular Expressions, Pattern Matching with Regular Expressions Introduction Suppose you have been regex packages for Java, and you may occasionally meet code using them, but so that the character e followed by the “combining character mark” for the acute in particular, newline matching to extract a value from an HTML page on the� I want to remove all the html tags except <br> or <br/> tags from a string using javascript. I have seen many questions like this but their answers will remove all the html tags including <br> and <br/> tags.

Comments
  • What is your expected output?
  • Please see the edit
  • Are the output always in between Item 1 Business and Item 1A Risk factors?
  • Yes, almost always, however, if I don't use the tags, I might get wrong matches because sometimes "Item 1 Business" and "Item 1A Risk Factors" are used within the text
  • Have we already forgotten? stackoverflow.com/a/1732454/1428679
  • Sometimes there are different titles written with bold in between, that's why I also need to specify Item 1A Risk factors
  • Maybe it is better if I share the full code and the page
  • Maybe sharing the actual page and full code would work better
  • I am aware of the fact that regex is not the best solution for html in general, however I don't have any other option because the html strcture of the pages changes in Edgar website.
  • Yes, I read your post, which is why I made a regex solution anyway. I've been in several situations where regex is the only choice for xml and find the Python XML libraries HORRIBLE so I totally understand. See if my solution works. Let me know.
  • It works for the example but not for the real page itself, in my opinion, that one should work for the page, 'bold;\">\sItem 1\.(.+?)bold;\">\sItem 1A\.' but it does not work...
  • Could you post the other examples so I can see specifically what I'd be building against? It's kind of hard to get an idea without seeing all f the original text or at least more examples.