Open and read txt file that are space delimited

I have a space seperated txt file like following:

2004          Temperature for KATHMANDU AIRPORT       
        Tmax  Tmin
     1  18.8   2.4 
     2  19.0   1.1 
     3  18.3   1.7 
     4  18.3   1.0 
     5  17.8   1.3 

I want to calculate the mean of both Tmax and Tmin seperately. But, I am having hard time reading txt file. I tried this link like.

import re
list_b = []
list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        list_line = re.findall(r"[\d.\d+']+", line)
        list_b.append(float(list_line[1])) #appends second column
        list_d.append(float(list_line[3])) #appends fourth column

print list_b
print list_d

But, it is giving me error : IndexError: list index out of range what is wrong here?

A simple way to solve that is to use split() function. Of course, you need to drop the first two lines:

with"path/to/file.txt", mode="r", encoding="utf-8") as f:
    for line in f:

You get:

['1', '18.8', '2.4']
['2', '19.0', '1.1']
['3', '18.3', '1.7']
['4', '18.3', '1.0']
['5', '17.8', '1.3']

Quoting the documentation:

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

The TextFieldType property defines whether it is a delimited file or one with fixed-width fields of text. To parse a comma delimited text file Create a new TextFieldParser. The following code creates the TextFieldParser named MyReader and opens the file test.txt.

import re
list_b = []
list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        # regex is corrected to match the decimal values only
        list_line = re.findall(r"\d+\.\d+", line) 

        # error condition handled where the values are not found 
        if len(list_line) < 2: 

        # indexes are corrected below
        list_b.append(float(list_line[0])) #appends second column
        list_d.append(float(list_line[1])) #appends fourth column

print list_b
print list_d

I have added my answer with some comments in the code itself.

You were getting the Index out of range error because your list_line was having only a single element(i.e. 2004 in the first line of file) and you were trying to access the 1st and 3rd index of the list_line.

I try to read the file into pandas. The file has values separated by space, but with different number of spaces I tried: pd.read_csv('file.csv', delimiter=' ') but it doesn't work

Full Solution

def readit(file_name,start_line = 2): # start_line - where your data starts (2 line mean 3rd line, because we start from 0th line) 
    with open(file_name,'r') as f:
        data ='\n')
    data = [i.split(' ') for i in data[start_line:]]
    for i in range(len(data)):
        row = [(sub) for sub in data[i] if len(sub)!=0]
        yield int(row[0]),float(row[1]),float(row[2])

iterator = readit('TA103019.95.txt')

index, tmax, tmin = zip(*iterator)

mean_Tmax = sum(tmax)/len(tmax)
mean_Tmin = sum(tmin)/len(tmin)
print('Mean Tmax: ',mean_Tmax)
print('Mean Tmnin: ',mean_Tmin)

>>> ('Mean Tmax: ', 18.439999999999998)
>>> ('Mean Tmnin: ', 1.5)

Thanks to Dan D. for more Elegant solution

The characters used as a separators and delimiters will be visible, if you open the.txt file in Writer and enable the hidden characters (View > Nonprinting Characters). If your file still only opens in Writer, check if it doesn't contain illegal characters, e.g. null characters. They will show as (rows of) #'s in Writer.

Save Excel File to space delimited text file by asjacobsen Apr 23, 2009 6:04AM PDT There's an easier way than saving as a CSV file:

Simplify your life and avoid 're' for this problem.

Perhaps you are reading the header row mistakenly? If the format of the file is fixed, I usually "burn" the header row with a line read before starting the loop like:

with open(file_name, 'r') as f:
    f.readline()  # burn the header row
    for line in f:
        tokens = line.strip().split(' ')   # tokenize the row based on spaces

Then you have a list of tokens, which will be strings that you'll need to convert to int or float or whatever and go from there!

Put in a couple print statements to see what you are picking up...

Hi, I have space delimited text file with spaces between two names. I want to use the infile option with delimiter as space and keep the name in the same column. Here is a snippet of the code I am having. Agoura Hills 21127 290 290 0 3 5 48 64 155 15 0 and so on. Thanks in advance.

I've seen before a way to read data into a data frame that is tabbed or white spaced in your working script file. For example: dat <- SOMETHING( person1 12 15 person2 15 18 person3 20 14 ) Say you're grabbing data from a website and just want to table a few things, and it comes off like this with white space etc.

Description tdfread opens the Select File to Open dialog box for interactive selection of a data file, and reads the data from the file you select. tdfread can read data from tab-delimited text files with.txt,.dat, or.csv file extensions.

Tab-delimited files. The options available for reading in a .csv file in proc import also exist for tab-delimited files: you can opt to read in or not read in names from your file; you can treat tab-delimited files as a special type of external file with extension .txt of your can treat your file as an instance of a delimited file and describe the delimiter.

  • It is giving me error NameError: name 'io' is not defined
  • Use import io
  • You could have skipped two lines by calling next(f) twice and then in a single for line in f: you could have parsed and yielded each line. This would eliminate both lists data and processed. And the transpose can be done with zip as index, tmax, tmin = zip(*iterator).
  • Thanks. I partly edited solution, not fully (didnt know much how to implement next(f) but also didnt want to spend too much time on it). Feel free to edit the answer.
  • It is giving me error IndexError: list index out of range
  • It's because we didn't deal with the first two lines that are not the data.