I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):

REVENUE 9,000,000 900,000

COST OF SALES 900,000 900,000

GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000

I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:

[[REVENUE], [9,000,000], [9,000,000]]

I came across this stack overflow post where someone attempts to use re.match() to the .groups() method to find the pattern: How to split strings into text and number?

I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.

I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.

  1. ([A-Za-z ]+)(?=\d|\S) match name until we found a number or symbol.
  2. .*? for the string which we do not care
  3. ([\d,]+)\s([\d,]+|(?=-\n|-$)) match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.

Test code(edited):

import re

regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))"

text = """
REVENUE 9,000,000 900,000

COST OF SALES 900,000 900,000

GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000

Business taxes 999 -

# [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]

Regexes are overkill for this problem as you've stated it.

text.split() and a join of the items before the last two is better suited to this.

lines = [ "REVENUE 9,000,000 900,000",
          "COST OF SALES 900,000 900,000",
          "GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ]
out = []
for line in lines:
    parts = line.split()
    if len(parts) < 3:
        raise InputError
    if len(parts) == 3:
        out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])

out will contain

 [['REVENUE', '9,000,000', '900,000'], 
  ['COST OF SALES', '900,000', '900,000'], 
  ['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]

If the label text needs further extraction, you could use regexes, or you could simply look at the items in parts[0:len(parts)-2] and process them based on the words and numbers there.

