what should I do to reduce the running time of this process, matching Text file keywords in Csv Column In python?

python csv write to specific column
python extract specific data from csv file
python csv read specific row and column
how to write multiple columns in csv python
update column value in csv python
python write list to csv column
python read csv line by line
csv writer python

I am using the following code in which I have a dictionary file, Dictionary.txt, and a search text file, SearchText.csv, and I am using regex to find and store the matching keywords and count them.

I have a problem: some of the files are thousands or hundreds of thousands of keywords and it takes too much time to process. I run the code on one dictionary which has 300,000 keywords and after an hour it hasn't written a single row.

So, what should I do to reduce the running time of this process?

import csv
import time
import re
allCities = open('Dictionary.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('SearchText.csv') as descriptions,open('Result---' + str(timestr) + '.csv', 'w', newline='') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['Sr_Num', 'Search', 'matched Keywords', 'Total matches']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    for eachRow in descriptions_reader:
        matches = 0
        Sr_Num = eachRow['Sr_Num']
        description = eachRow['Text']
        citiesFound = set()
        for eachcity in allCities:
            eachcity=eachcity.strip()
            if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
                citiesFound.add(eachcity)
                matches += 1
        if len(citiesFound)==0:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " - ", 'Total matches' : matches})

        else:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " , ".join(citiesFound), 'Total matches' : matches})
        line += 1
        print(line)

print(" Process Complete ! ")

Here is an example of some rows from Dictionary.txt:

les Escaldes
Andorra la Vella
Umm al Qaywayn
Ras al Khaimah
Khawr Fakkn
Dubai
Dibba Al Fujairah
Dibba Al Hisn
Sharjah
Ar Ruways

Your biggest time waster if this line:

if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):

You are searching the whole description for each eachcity. That's a lot of searching. Consider pre-splitting description into words with nltk.word_tokenize(), converting it to a set, converting allCities into a set as well, and taking a set intersect. Something like this:

citiesFound = set(nltk.word_tokenize(description)) & set(allCities)

No inner loop required.

14.1. csv — CSV File Reading and Writing — Python 3.4.10 , These differences can make it annoying to process CSV files from multiple sources. a string each time its __next__() method is called — file objects and list objects The other optional fmtparams keyword arguments can be given to override in CSV format) and return True if the first row appears to be a series of column� Running every *.csv file you run into through the XML parser. I get the feeling you consider this necessary in order to discard XML files that are pretending to be CSV files. Ideally, you should do this once, then properly label your files thenceforth so you don't have to do this check every time.

Perform operations which only need to be executed once only once:

Instead of

eachcity.strip()

and

re.IGNORECASE

in the loop do

allCities = [city.strip().lower() for city in allCities]

outside of the loop, and convert description to lowercase.

You can remove matches += 1 as well, (it's the same as len(citiesFound)), but that will not give much improvement.

If you do not know where your bottleneck really is, look at the tips here and here. Also, run a profiler on your code to find the real culprit. There is also a SO question regarding profiling which is very useful.

Another possibility is to use C or languages which are more optimized for text handling, like awk or sed.

Pythonic Data Cleaning With Pandas and NumPy – Real Python, You can download the datasets from Real Python's GitHub repository in order to follow the Pandas provides a handy way of removing unwanted columns or rows from a First, let's create a DataFrame out of the CSV file 'BL-Flickr-Images- Book.csv'. Let's see what happens when we run this regex across our dataset:. The running time is much longer (> 2 mins) than the looping. main module can be safely imported by a new Python interpreter, e.g. by a child process. was found to greatly reduce are

Use databases instead of the file system.

In your case I'd probably use Elasticsearch or MongoDB. Those systems are made for handling large amounts of data.

Using Gazetteers to Extract Sets of Keywords from Free-Flowing Texts, That way you can run the code lots of times as you refine and then pasting in our new outputs from our gazetteer matching. Import the keywords f = open(' gazetteer.txt', Import the 'Details' column from the CSV file allTexts to parse for our keywords row = row['Details'].lower()� Categories of Joins¶. The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins. All three types of joins are accessed via an identical call to the pd.merge() interface; the type of join performed depends on the form of the input data.

In addition to Jan Christoph Terasa answer

1. allCities - are candidate for set

So:

allCities = set([city.strip().lower() for city in allCities])

and even more:

2. Use set of precompiled regular expressions
allCities = set([re.compile( '\\b'+ city.strip().lower() + '\\b') for city in allCities])

Extract, Transform, and Save CSV data • fredgibbs.net, Sometimes you'll have a CSV file that contains lots of useful information, but a CSV file, clean them up a bit, and save them to a regular text file using python. Fortunately, Python makes it very easy to read and write CSV files that can do a lot 3rd column, but computers always count starting with 0–so row[2] should give� Simply typing tasklist and hitting the Enter-key displays a list of all running processes on the system. Each process is listed with its name, process ID, session name and number, and memory usage. You can save the process listing right away by running the command tasklist > output directory and file name, e.g. tasklist > d:\processes.txt.

Tutorial: Python Regex (Regular Expressions) for Data Scientists, In this Python regex tutorial, learn how to use regular expressions and the pandas Regular expressions (regex) are essentially text patterns that you can use to First, we'll prepare the data set by opening the test file, setting it to read- only, In each cycle, we'll execute re.findall again, matching the first quotation mark to� To save processes to file, repeat the process above, but this time type the command: tasklist > c:\process_list.txt. This will output a text file named process_list.txt to your C: drive. You can change C:\ to any other path where you’d like to place the file if you want. To view the file, just open Windows Explorer and browse to the location

Data Science with Python: Intro to Loading, Subsetting, and Filtering , The data we want to load can be stored in different ways. The most common formats are the CSV files, Excel files, data is not organized in a pre-defined manner (plain text, images, audio, web The rows and the columns can have labels. Let's find all flights that have lower than 200 or greater than 375� The output of the operations will be stored in a file and will be overwritten every time the loop counter increments and goes to the next line. My output file should contain only 1 row at a time and not all the rows.

IO tools (text, CSV, HDF5, …) — pandas 1.1.0 documentation, The header can be a list of ints that specify row locations for a MultiIndex on the Using this parameter results in much faster parsing time and lower memory usage. Internally process the file in chunks, resulting in lower memory use while keyword can be used to specify a combination of columns to parse the dates� The execution time is the time from reading the pattern until the end of the matching process. The matching process compares patterns with packet traces and writes out the results. This time includes the execution of pattern reading, packet reading, pattern encoding, MapReduce scheduler initialization, and the searching of the Map function.

Comments
  • csv.DictWriter is buffered. The .writerow() method does not immediately write the results into the disk file. The fact that the file is empty does not mean that there is no progress. Consider printing something to the console to track the execution.
  • I like the set approach here. Could use a Counter after that to get counts of cities matched.
  • @TammoHeeren Sure. (Thought not required in the OP.)
  • Correct. Thought OP was looking for individual counts, but was looking for total count only.