Extract lines from text file using words as key from another text file

extract specific lines from text file python
extract lines from text file python
extract specific data from text file python
how to find a string in a text file using python
python search text file for word
extract lines from text file linux
python find string in file and print line
use sed to extract lines

I am working on NLP and need to do some pre-processing of data. I have two input files and have to generate an output file having intersection of those files where the first file acts as a key.

file 1 - contains list of words :

aided from the poetry to

file 2 :

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375

to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

and 0.26818 0.14346 -0.27877 0.016257 0.11384 0.69923 -0.51332 -0.47368 -0.33075 -0.13834 0.2702 0.30938 -0.45012 -0.4127 -0.09932 0.038085 0.029749 0.10076 -0.25058 -0.51818 0.34558 0.44922 0.48791 -0.080866 -0.10121 -1.3777 -0.10866 -0.23201 0.012839 -0.46508 3.8463 0.31362 0.13643 -0.52244 0.3302 0.33707 -0.35601 0.32431 0.12041 0.3512 -0.069043 0.36885 0.25168 -0.24517 0.25381 0.1367 -0.31178 -0.6321 -0.25028 -0.38097

The output that I want in a new file (file 3) should be :

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

The following code runs without any error but the ouput file that I am getting is empty :

f1 = open('input_key.txt', 'r')
f2 = open('input_file.txt', 'r')
f3 = open('output_file.txt', 'w')

for word in f1.readlines():
    for line in f2.readlines():
        if word is line.strip().split()[0]:     
            f3.write(line)

f1.close()
f2.close()
f3.close()

I am unable to understand what is wrong here. Any help appreciated. file2 and file3 have no extra lines in between. I just added those to make the question readable.

UPDATE Thanks to the comments, I got to know that the if statement is evaluating to false. Any way to overcome this or other alternatives to perform my task ?

This will do what you want it to do:

f1 = open('input_key.txt', 'r')
f2 = open('input_file.txt', 'r')
f3 = open('output_file.txt', 'a')

for word in f1.readlines():
    for line in f2.readlines():
        if line != '\n' and word.strip() == line.strip().split()[0]:
            f3.write(line)
    f2.seek(0)

f1.close()
f2.close()
f3.close()

You need to reset the cursor position for readlines at the end of each loop with f2.seek(0)

I would also open the output_file.txt as a (append), You can Delete the output_file.txt at the beginning of the script to clear it out each time you run it with:

import os
os.remove("output_file.txt")

I would also do == instead of is, is will test if two object are the same, not if something equals something else

EDIT: I would look at wiesion's answer below about list comprehension for some tips on writing cleaner code

How to Extract Specific Portions of a Text File Using Python, Okay, how can we use Python to extract text from a text file? In other words, starting at the 5th character in line[0], the first "e" is located at  Using sed to extract lines in a text file If you write bash scripts a lot, you are bound to run into a situation where you want to extract some lines from a file. Yesterday, I needed to extract the first line of a file, say named somefile.txt.

The keyword is is the identity operator, checking if 2 elements are the same identity

== is equality logic operator

if word is line.strip().split()[0]: 

Change it to

if word == line.strip().split()[0]: 

How can I both extract a specific line in a text file as well as multiple , Then, below those keys is all the data. I need to extract a subset of that data into a new text file so I can work with the subset (I don't need all the  Extract lines from text file using words as key from another text file. I am working on NLP and need to do some pre-processing of data. I have two input files and have to generate an output file having intersection of those files where the first file acts as a key.

I just copied your files and wrote the code as i would do if i was given your requirements:

with open("words.txt", "r") as word_file:
    words = [word.strip() for word in word_file.read().splitlines() if word.strip()]

with open("feed.txt", "r") as feed_file:
    lines = [line.strip() for line in feed_file.read().splitlines() if line.strip()]

with open('result.txt', 'w') as result_file:
    result_file.write("\n".join([line for line in lines if line.split()[0] in words]))

Of course i am doing quite some list comprehensions here to avoid all the nested loops.

If your word and input files are large, then you should avoid reading the entire file into memory with comprehensions (Thanks @Bayko for the reminder) and you should switch to:

words = []
with open("words.txt", "r") as word_file:
    # This reads the words file line by line instead of reading the entire file
    for word in word_file:
        word = word.strip()
        if word:
            words.append(word)

with open('result.txt', 'w') as result_file:
    with open("feed.txt", "r") as feed_file:
        # This reads the input file line by line instead of reading the entire file
        for line in feed_file:
            line = line.strip()
            if not line:
                continue
            if line.split()[0] in words:
                result_file.write(line + "\n")

Also when i run your code locally:

if word is line.strip().split()[0]:     

IndexError: list index out of range

This error happens because of empty lines - but most of all, you are stuck with f2.readlines() - you never do a f2.seek(0) to reset the position and == is not the same as is (See @Atterson's answer). Fixing those issues in your code it would look like:

f1 = open('words.txt', 'r')
f2 = open('feed.txt', 'r')
f3 = open('result.txt', 'w')

for word in f1.readlines():
    word = word.strip()
    if not word:
        continue
    for line in f2.readlines():
        line = line.strip()
        if not line:
            continue
        if word == line.split()[0]:
            f3.write(line + "\n")
    f2.seek(0)

f1.close()
f2.close()
f3.close()

With both scripts my result.txt looks like

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

Extract lines of text based on a specific keyword, I regularly extract lines of text from files based on the presence of a particular keyword; I place the extracted lines into another text file. Extract lines with specific words with addition 2 lines before and after which has only one column and look for this column data in the MAIN_FILE to extract all the rows that have this key. Any line in mouth.txt containing the string (text/numbers) tooth: would be copied into the file fairy_pocket.txt. The data in the source file is left intact (nothing is removed). The destination file is autocreated. The quotation marks are essential parts of this command.

You have three new lines in your second file. You need to perform a check on your array after you split it to check whether it has any elements or not. That solves the problem. Or do a try catch with IndexError and you should be good

How to extract rows from a text file with a specific start?, How to extract rows from a text file with a Learn more about extract row, text file. and I want to extract only the rows starting with $GNGGA and write them all to another text file Import text file and select lines starting with the Key string: compare for the 'all' pattern having found the record containing the other key word. Here are a few the many techniques that help with keyword extraction. Word Frequency Analysis (TF-IDF): This is the most well known techniques where in the frequency of each of the word in the document can obtained and that information can be used

Using sed to extract lines in a text file, This works! Now I'm considering more difficult case when in the sample line present other words or numbers, like this: "foo1=10  FOLKS , i have a text file that is generated automatically of an another korn shell script, i want to bring in the fifth line of the text file in to my korn shell script and look for a particular word in the line . Can you all share some thoughts on this one. thanks Venu (3 Replies)

How to create a Python dictionary from text file?, Assuming a following text file (dict.txt) is present1 aaa2 bbb3 cccFollowing d = {} with open("dict.txt") as f: for line in f: (key, val) = line.split()  I regularly extract lines of text from files based on the presence of a particular keyword; I place the extracted lines into another text file. This takes about 2 hours to complete using the "sort" command then Kate's find & highlight facility.

Python, Given some data in a text file, the task is to scramble the text and output in a separate text file. that reads a text file, scrambles the words in the file and writes the output to a new text file. Do this for a file and maintain sequences of lines. Extract numbers from a text file and add them using Python · Saving Text, JSON,  Extracted from a prepared file:" I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me.

Comments
  • are u sure your code reaches f3.write(line)
  • I'd strip word as well just to be safe, but my guess is your if statement is evaluating false
  • print everything, and debug step by step. find where the error exactly is
  • When I run this, I'm getting an IndexError at the if statement...
  • @emsimpson92 tried it with strip but still the file is empty
  • Thank you. This is working. Though I am unable to understand how. Could you please elaborate more on the if statement ?
  • Just added more context, but line.strip().split()[0] returns a string so you would need to use == to see if 2 strings equal` each other. is will check to see if 2 objects are the same. More info here: stackoverflow.com/questions/13650293/…
  • eh, took the f2.seek(0) from my answer instead of re-opening for each word? ;) Anyways it's not just about cleaner code, it is also about avoiding unnecessary nested loops (and filtering out empty rows, which turns out not be the case here).
  • Yeah I took the seek, it much better than closing and opening the file over and over.
  • Tried that. Still not working. I actually came to is after == !
  • Might not have solved it, but this definitely wasn't doing you any favors
  • Yes, that was the second issue after not doing a f2.seek(0)
  • Your list comprehension approach is not optimal. What you are doing is storing the entire array in memory which in this case I guess is ok and makes for a cleaner code. However, if the files are big in size you will have to use fseek