Reading financial statements using REGEX
I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:
[[REVENUE], [9,000,000], [9,000,000]]
I came across this stack overflow post where someone attempts to use
re.match() to the
.groups() method to find the pattern: How to split strings into text and number?
I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.
I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.
([A-Za-z ]+)(?=\d|\S)match name until we found a number or symbol.
.*?for the string which we do not care
([\d,]+)\s([\d,]+|(?=-\n|-$))match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.
import re regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))" text = """ REVENUE 9,000,000 900,000 COST OF SALES 900,000 900,000 GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000 Business taxes 999 - """ print(re.findall(regex,text)) # [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]
Reading financial statements using REGEX, information extraction from annual report python regex parse financial statements python regular expression machine learning analytics vidhya regular How to use the tables The tables are meant to serve as an accelerated regex course, and they are meant to be read slowly, one line at a time. On each line, in the leftmost column, you will find a new element of regex syntax. The next column, "Legend", explains what the element means (or encodes) in the regex syntax.
Regexes are overkill for this problem as you've stated it.
text.split() and a
join of the items before the last two is better suited to this.
lines = [ "REVENUE 9,000,000 900,000", "COST OF SALES 900,000 900,000", "GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ] out =  for line in lines: parts = line.split() if len(parts) < 3: raise InputError if len(parts) == 3: out.append(parts) else: out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])
out will contain
[['REVENUE', '9,000,000', '900,000'], ['COST OF SALES', '900,000', '900,000'], ['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]
If the label text needs further extraction, you could use regexes, or you could simply look at the items in
parts[0:len(parts)-2] and process them based on the words and numbers there.
Using regex to parse financial statements, Reading financial statements using REGEX, But i am not sure what your desired output is with your third sentence. ([A-Za-z ]+)(?=\d|\S) match name until we Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of
Extracting information from reports using Regular Expressions , For example names of companies – prices from financial reports, names Python supports regular expressions by the library called “re”(though it's not fully Perl-compatible). have alternate ways of getting same results, especially by using meta Read the comprehensive information about RegEx here. Using financial statements to grow your business. Once you get used to reading financial statements, they can actually be fun. By analyzing your net income and cash flows, and looking at past trends, you’ll start seeing many ways you can experiment with optimizing your financial performance.
[PDF] the annual report algorithm: retrieval of financial statements and , information from financial statements filed with the SEC and its EDGAR system is regular expressions before presenting the actual text embedded in financial  SEC (2011) Fast Answers – How to Read a 10-K, Available online on URL:. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more C# - Using regex with if statements
How to Parse 10-K Report from EDGAR (SEC), "In this notebook we will apply REGEX & BeautifulSoup to find useful financial information in 10-Ks. In particular, we will extract text from Items pre-installed as follows & it help BeatifulSoup read HTML, XML documents:\n",. Reading Financial Statements Course. Learn how to read financial statements. In this 2-part free course, we use a company's financial statements and annual report to understand the financial strength of a company and help us make informed decisions. Reading & Understanding the Balance Sheet
[PDF] Automatic extraction and analysis of financial , "In this notebook we will apply REGEX & BeautifulSoup to find useful financial be using 'lxml' which can be pre-installed as follows & it help BeatifulSoup read Financial statements are the report card of a business. Whether you are a new investor, a small business owner, an executive, or just trying to keep track of your personal finances, you need to understand how to read, analyze, and create financial statements so you can get a full and accurate understanding of your finances.