Removing all stopwords defined in a file from a text in another file (Python)

I have two text files:

  1. Stopwords.txt --> contains stop words one per line
  2. text.txt --> big document file

I'm trying to remove all occurences of stopwords (any word in the stopwords.txt file) from the text.txt file without using NLTK (school assignment).

How would I go about doing this? This is my code so far.

import re

with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:
    f_content = f.read()
    #splitting text.txt by non alphanumeric characters
    processed = re.split('[^a-zA-Z]', f_content)

    st_content = st.read()
    #splitting stopwords.txt by new line
    st_list = re.split('\n', st_content)
    #print(st_list) to check it was working

    #what I'm trying to do is: traverse through the text. If stopword appears, 
    #remove it. otherwise keep it. 
    for word in st_list:
        f_content = f_content.replace(word, "")
        print(f_content) 

but when I run the code, it first takes forever to output something and when it does it just outputs the entire text file. (I'm new to python so let me know if I'm doing something fundamentally wrong!)

Here is what I use when I need to remove English stop words. I usually also use the corpus from nltk instead of my own file for stop words.

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Remove stop words
stops = set(stopwords.words("english"))
text = [ps.stem(w) for w in text if not w in stops and len(w) >= 3]
text = list(set(text)) #remove duplicates
text = " ".join(text)

For your special case I would do something like:

stops = list_of_words_from_file

Let me know if I answered your question, I am not sure if the problem is the read from file or the stemming.

Edit: To remove all stopwords defined in a file from a text in another file, we can use str.replace()

for word in st_list:
    f_content=f_content.replace(word)

For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK (Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.

I think this kind of worked... but it's incredibly slow so if anyone has any suggestions on how to make this more efficient I'd really appreciate it!

import re
from stemming.porter2 import stem as PT


with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:

    f_content = f.read()
    processed = re.split('[^a-zA-Z]', f_content)
    processed = [x.lower() for x in processed]
    processed = [PT(x) for x in processed]
    #print(processed)

    st_content = st.read()
    st_list = set(st_content.split())

    clean_text = [x for x in processed if x not in st_list]
    print clean_text

spaCy is one of the most versatile and widely used libraries in NLP. We can quickly and efficiently remove stopwords from the given text using SpaCy. It has a list of its own stopwords that can be imported as STOP_WORDS from the spacy.lang.en.stop_words class. Here’s how you can remove stopwords using spaCy in Python:

Based on the fact you are facing performance issues. I would suggest using the subprocess library (for Python2, or for Python3) to call the sed linux command.

I know Python is really good for this kind of thing (and many others), but if you have a really big text.txt. I would try the old, ugly, and powerful command-line 'sed'.

Try something like:

sed -f stopwords.sed text.txt > output_file.txt

For the stopwords.sed file, each stopword must be in a different line and using the format below:

s|\<xxxxx\>||g

Where 'xxxxx' would be the stopword itself.

s|\<the\>||g

The line above would remove all occurrences of 'the' (without single quotes)

Worth a try.

Here is how you might incorporate using the stop_words set to remove the stop words from your text: from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example_sent = "This is a sample sentence, showing off the stop words filtration."

In this tutorial, You will learn how to write a program to remove punctuation and stopwords in python using nltk library. How to remove punctuation in python nltk. We will regular expression with wordnet library.

Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words. In this article you will learn how to remove stop words with the nltk module. Related course. Natural Language Processing with Python; Natural Language Processing: remove stop

Also you can remove the words of your choice by adding the required words in the file inside stopwords directory which you can find inside nltk_corpus. Here, original.txt is the original text file you want to filter and filtered.txt is the text file created that will contain the filtered text with no stopwords and the stemmed words.

Comments
  • You seem to be calling st.read() outside of your with block, meaning st will be closed. Also, what is the problem with this code so far?
  • hey @JammyDodger thanks for the reply! I've updated my original question ^ with the problem. If you could help that would be much appreciated!!
  • thats the strategy I wanted to go for but we're not allowed to use NLTK... we've been given a text file with a bunch on stopwords they want us to remove from the document file...any suggestions?!? thanks!!!
  • So you only need to delete certain words from a text? No stemming involved?
  • yes, so for example stopwords.txt is [a, the, it, from] and the text.txt is [Hello I am from the UK] should become [Hello I am UK]. I also need to stem but thats the next step! i've managed to get the stem working but i need to remove the stop words first.
  • Check the snippet I added in the edit. Didn't get to test it, but should work.
  • hey! i think the snippet has the right thought in mind but the replace method didn't work because it was expecting two arguements (I updated that in my code above) but im still having problems :(