Need to extract tabular data from a text file in python3

extract table from text file python
extract specific data from text file python
python parse data from a text file
python write to text file
python petl examples
python write table to text file
python parse text file to csv
script to extract data from text file

I have output from a Quantum Chemistry program from which I wish to extract tabular data for input into a Python port of a FORTRAN program I wrote about 25 years ago.

Some of the output files are rather long, as many as 6000 lines which precludes the use of a spreadsheet for processing.

A typical table is of the form:

                             CARTESIAN COORDINATES

   1    C        0.011987266    -0.003842185     0.006578784
   2    H        1.097152909    -0.003956163     0.013339310
   3    H       -0.349612312     1.019316731     0.001903075
   4    H       -0.344276148    -0.517463019    -0.880495291
   5    H       -0.355315644    -0.513266496     0.891567896

I'm not asking for someone to write the Python code for me, but rather give me some guidance thorough the labyrinth of available code.

I suggest you look into np.genfromtxt. The following code snippet will read the example data from your question stored in a file called data.txt.

import numpy as np
data = np.genfromtxt('data.txt', skip_header=2, dtype=[('id', 'i8'),('label','S1'),('x','f8'),('y','f8'),('z','f8')])
print(data)

Output

 [(1, b'C',  0.01198727, -0.00384219,  0.00657878)
 (2, b'H',  1.09715291, -0.00395616,  0.01333931)
 (3, b'H', -0.34961231,  1.01931673,  0.00190307)
 (4, b'H', -0.34427615, -0.51746302, -0.88049529)
 (5, b'H', -0.35531564, -0.5132665 ,  0.8915679 )]

Extract/Load, None - read from stdin; string starting with http:// , https:// or ftp:// - read from URL; string ending with functions load data from a table into a file-like source or database. The rows in the table should have been pickled to the file one at a time. A Python program can read a text file using the built-in open () function. For example, below is a Python 3 program that opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and then prints the data.

I would use readlines and split.

cc = 'CARTESIAN_COORDINATES.txt'

with open(cc) as data:
    lines = data.readlines()[2:] # skip first two lines
    for line in lines:
        ls = line.split()
        a, b, c, d, e = int(ls[0]), ls[1], float(ls[2]), float(ls[3]), float(ls[4])
        print(a, b, c, d, e)

Output:

1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.01333931
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896

Extract - reading tables from files, databases and other sources, The following functions extract a table from a file-like source or database. of file​-like sources, e.g., reading data from a Zip file, a string or a subprocess, see Note that all data values are strings, and any intended numeric values will need to  To parse text files into tables for analysis you'd need to build a custom parser, use a loop function to read text chunks, then use an if/then statement or r

Regex are build to extract things from data - if your tables are always well defined you could extract them using f.e.: https://regex101.com/r/QUT2o3/2

import re

regex = r"(\d+ +\w+ (?: +-?\d+\.\d+){3}.+?(?:\n|\Z){2})+"

test_str = ("                      CARTESIAN COORDINATES\n\n"
    "   1    C        0.011987266    -0.003842185     0.006578784\n"
    "   2    H        1.097152909    -0.003956163     0.013339310\n"
    "   3    H       -0.349612312     1.019316731     0.001903075\n"
    "   4    H       -0.344276148    -0.517463019    -0.880495291\n"
    "   5    H       -0.355315644    -0.513266496     0.891567896\n\n\n\n"
    "                      CARTESIAN COORDINATES\n\n"
    "   1    C        0.011987266    -0.003842185     0.006578784\n"
    "   2    H        1.097152909    -0.003956163     0.013339310\n"
    "   3    H       -0.349612312     1.019316731     0.001903075\n"
    "   4    H       -0.344276148    -0.517463019    -0.880495291\n"
    "   5    H       -0.355315644    -0.513266496     0.891567896\n\n\n"
    "                      CARTESIAN COORDINATES\n\n"
    "   1    C        0.011987266    -0.003842185     0.006578784\n"
    "   2    H        1.097152909    -0.003956163     0.013339310\n"
    "   3    H       -0.349612312     1.019316731     0.001903075\n"
    "   4    H       -0.344276148    -0.517463019    -0.880495291\n"
    "   5    H       -0.355315644    -0.513266496     0.891567896")

Apply regex:

matches = re.findall(regex, test_str, re.MULTILINE | re.DOTALL)
for m in matches:
    print('\n'.join(x.strip() for x in m.splitlines()))

Output:

1    C        0.011987266    -0.003842185     0.006578784
2    H        1.097152909    -0.003956163     0.013339310
3    H       -0.349612312     1.019316731     0.001903075
4    H       -0.344276148    -0.517463019    -0.880495291
5    H       -0.355315644    -0.513266496     0.891567896

1    C        0.011987266    -0.003842185     0.006578784
2    H        1.097152909    -0.003956163     0.013339310
3    H       -0.349612312     1.019316731     0.001903075
4    H       -0.344276148    -0.517463019    -0.880495291
5    H       -0.355315644    -0.513266496     0.891567896

1    C        0.011987266    -0.003842185     0.006578784
2    H        1.097152909    -0.003956163     0.013339310
3    H       -0.349612312     1.019316731     0.001903075
4    H       -0.344276148    -0.517463019    -0.880495291
5    H       -0.355315644    -0.513266496     0.891567896

Parsing text with Python · vipinajayakumar, For example, let's say we have a CSV file, data.txt: We can use these methods to extract data from a string as you can see in the simple Finally, we need a regular expression to identify whether the table that education 1; javascript 1; lifestyle 1; math 2; money 3; productivity 1; programming 2; python 3  Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place! Camelot: PDF table extraction for humans Today, we’re pleased to announce the release of Camelot, a Python library and command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files!

Reading and Writing tabular ASCII data, Astronomers love storing tabular data in human-readable ASCII tables. f = open('data.txt', 'r') # We need to re-open the file data = f.read() f.close() f.​readline() header3 = f.readline() # Loop over lines and extract variables of interest for works on Python 2 due to BeautifulSoup doing something differently in Python 3. Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small tables without hassle: just one function call, formatting is guided by the data itself; authoring tabular data for lightweight plain-text markup: multiple output formats suitable for further editing or transformation

Python Trainer Tip: Parsing Data Into Tables from Text Files with , To parse text files into tables for analysis you'd need to build a custom parser, use a loop Duration: 1:15 Posted: Apr 12, 2017 The “from…” functions extract a table from a file-like source or database. For everything except petl.io.db.fromdb () the source argument provides information about where to extract the underlying data from. If the source argument is None or a string it is interpreted as follows: None - read from stdin

Reading and Writing CSV Files in Python – Real Python, Learn how to read, process, and parse CSV from text files using Python. Get a sample chapter from Python Basics: A Practical Introduction to Python 3 to see file) is a type of plain text file that uses specific structuring to arrange tabular data. Properly parsing a CSV file requires us to know which delimiter is being used. This chapter covers all the basic I/O functions available in Python 3. For more functions, please refer to the standard Python documentation. Python 2 has two built-in functions to read data from standard input, which by default comes from the keyboard. These functions are input() and raw_input() In