How to avoid loading a large file into a python script repeatedly?

python read large csv file in chunks
fastest way to read csv python
python read csv in chunks
python reload console
csv.dictreader to dataframe

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.

My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:

def load_train_data(train_file):
    # Read in training file
    train_f = io.open(train_file)
    train_id_list = []
    train_val_list = []
    for line in train_f:
        list_line = line.strip().split("\t")
        if list_line[0] != "Domain":
            train_identifier = list_line[9]
            train_values = list_line[12:]
            train_id_list.append(train_identifier)
            train_val_float = [float(x) for x in train_values]
            train_val_list.append(train_val_float)
    train_f.close()
    train_val_array = np.asarray(train_val_list)

    return(train_id_list,train_val_array)

This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.

I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).

If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.

I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.

Then you can import another file with your model code, and run that with the training data as argument.

If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.

If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.

If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)

How to avoid loading a large file into a python script repeatedly?, If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run. I would put  Open this file up in Excel or LibreOffice, and confirm that the data is correct. Conclusion. So, what did we accomplish? Well, we took a very large file that Excel could not open and utilized Pandas to-Open the file. Perform SQL-like queries against the data. Create a new XLSX file with a subset of the original data.

Simplest way would be to cache the results, like so:

_train_data_cache = {}
def load_cached_train_data(train_file):
  if train_file not in _train_data_cache:
    _train_data_cache[train_file] = load_train_data(train_file)
  return _train_data_cache[train_file]

The Top Mistakes Developers Make When Using Python for Big , Python expert Karolina Alexiou shows how to avoid some of the most common the candidate needs to load a CSV file into memory in order to work with it. from HDFS to the local filesystem, and then launches a Python script on the data to  Navigating Through Large Text Files. Although the above step allowed us to read large text files by extracting lines from that large file and sending those lines to another text file, directly navigating through the large file without the need to extract it line by line would be a preferable idea.

Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.

Reading Files with Python, In this article we will be explaining how to read files with Python through examples. the opened files automatically at the end of the execution of the Python program, from itertools import islice # define the name of the file to read from filename While readlines() will read content from the file until it hits EOF, keep in mind  A Python program can read a text file using the built-in open() function. For example, below is a Python 3 program that opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and then prints the data.

Load your data in ipython.

my_data = open("data.txt")

Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:

import sys

args = sys.argv

data = args[1]
...

Now run the python script in ipython:

%run example.py $mydata

Now, when running your python script, you don't need to load data multiple times.

Methods in Medical Informatics: Fundamentals of Healthcare , Fundamentals of Healthcare Programming in Perl, Python, and Ruby Jules J. You must keep track of the file names and directory locations of your external you will find that key–value pairs that are accessed repeatedly, within a script, will be In the prior two sections, we showed how you could load a large text corpus​  I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use. Then you can import another file with your model code, and run that with the training data as argument.

Python 3 Object-Oriented Programming: Build robust and , You don't want to find out what happens if you try to load that much data into memory! We can call it repeatedly to get additional lines. For readability, and to avoid reading a large file into memory at once, it is often better to use Technically, this will happen automatically when the script exits, but it's better to be explicit  Python 3: new exec (execfile dropped)!. The execfile solution is valid only for Python 2. Python 3 dropped the execfile function – and promoted the exec statement to a builtin universal function.

Getting Started with Python: Understand key data structures and , You don't want to find out what happens if you try to load that much data into memory! We can call it repeatedly to get additional lines. For readability, and to avoid reading a large file into memory at once, it is often better to use Technically, this will happen automatically when the script exits, but it's better to be explicit  If you are going to be working on a data set long-term, you absolutely should load that data into a database of some type (mySQL, postgreSQL, etc) but if you just need to do some quick checks / tests / analysis of the data, below is one way to get a look at the data in these large files with python, pandas and sqllite.

The most (time) efficient ways to import CSV data in Python, This post focuses only on CSV files, as these are quite often the format of choice for tabular data. The script uses numpy and pandas methods to create arrays of random costs on the time calculation by executing the code repeatedly. This will reduce the pressure on memory for large input files and  In the Source step we can see the Python.Execute command which allows us to run Python script. I am still deciding if I am happy or disappointed to discover that Python.Execute accepts a single string parameter which contains the entire Python script. At least I was happy that it was easy to edit. I made a dummy (but valid) change to see what

Comments
  • I believe that if you run in the python console, you could load the file once and then load other files / call functions separately, without having to reload the file
  • You must have a look at the pandas library for data handling. Manipulating data using it is a child's play. You will be able to grasp it fairly quickly if you have used R before. Specifically, you should have a look at the read_xxx functions in the documentation which allow you to load different file formats into a dataframe.