How to load batches of CSV files using tf.data and map

keras load data from csv
tf.data.experimental.make_csv_dataset example
tensorflow load local csv
tensorflow read csv
tf.data.csv javascript
tensorflow dataset
how do i import a csv file into tensorflow
load csv data

I have been searching for an answer as to how I should go about this for quite some time and can't seem to find anything that works.

I am following a tutorial on using the tf.data API found here. My scenario is very similar to the one in this tutorial (i.e. I have 3 directories containing all the training/validation/test files), however, they are not images, they're spectrograms saved as CSVs.

I have found a couple solutions for reading lines of a CSV where each line is a training instance (e.g., How to *actually* read CSV data in TensorFlow?). But my issue with this implementation is the required record_defaults parameter as the CSVs are 500x200.

Here is what I was thinking:

import tensorflow as tf
import pandas as pd

def load_data(path, label):
   # This obviously doesn't work because path and label
   # are Tensors, but this is what I had in mind...
   data = pd.read_csv(path, index_col=0).values()
   return data, label

X_train = tf.constant(training_files)  # training_files is a list of the file names
Y_train = tf.constant(training_labels  # training_labels is a list of labels for each file

train_data = tf.data.Dataset.from_tensor_slices((X_train, Y_train))

# Here is where I thought I would do the mapping of 'load_data' over each batch
train_data = train_data.batch(64).map(load_data)

iterator = tf.data.Iterator.from_structure(train_data.output_types, \
                                           train_data.output_shapes)
next_batch = iterator.get_next()
train_op = iterator.make_initializer(train_data)

I have only used Tensorflows feed_dict in the past, but I need a different approach now that my data has gotten to the size that it can no longer fit in memory.

Any thoughts? Thanks.

I was also looking for "how to load csv files using tf.data" and the following example was very useful for me.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/iris.py

I hope it'll help you too.

Load CSV data, One per column of CSV data, with either a scalar Tensor default value for the column if it A tf.bool scalar indicating whether the CSV file(s) have header line( s) that should be skipped when parsing. Combines consecutive elements of this dataset into batches. Maps map_func across this dataset and flattens the result. This tutorial provides an example of how to load CSV data from a file into a tf.data.Dataset. The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

I use Tensorflow (2.0) tf.data to read my csv dataset. I have several folders for each class. Each folder contains thousands of csv files of data points. Below is the code I use for the data input pipeline. Hope this helps.

import tensorflow as tf

def tf_parse_filename(filename):

    def parse_filename(filename_batch):
        data = []
        labels = []
        for filename in filename_batch:
            # Read data
            filename_str = filename.numpy().decode()
            # Read .csv file 
            data_point= np.loadtxt(filename_str, delimiter=',')

            # Create label
            current_label = get_label(filename)
            label = np.zeros(n_classes, dtype=np.float32)
            label[current_label] = 1.0

            data.append(data_point)
            labels.append(label)

        return np.stack(data), np.stack(labels)


    x, y = tf.py_function(parse_filename, [filename], [tf.float32, tf.float32])
    return x, y

train_ds = tf.data.Dataset.from_tensor_slices(TRAIN_FILES)
train_ds = train_ds.batch(BATCH_SIZE, drop_remainder=True)
train_ds = train_ds.map(tf_parse_filename, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.prefetch(buffer_size=AUTOTUNE)

#Train on epochs
for i in range(num_epochs):
    # Train on batches
    for x_train, y_train in train_ds:
        train_step(x_train, y_train)

print('Training done!')

"TRAIN_FILES" is a matrix (e.g. pandas dataframe) where the first column is the label of a data point and the second column is the path to the csv file containing the data point.

tf.data.experimental.CsvDataset, Reading CSV file by using Tensorflow Data API and Splitting Tensor into Training and Test Sets for LSTM dataset = dataset.batch(num_steps) def decode_csv( line): rDefaults = [[0.]] one-hot encoded. dataset = dataset.map(decode_csv) dataset = dataset.apply(tf.data. We were unable to load Disqus. Again, we will load image using tf.data.Dataset Actually there is another way to load image; keras.preprocessing, however for efficiency reason it is not very recommended.

I suggest looking at this thread. It provides a complete example for how to use dataset API to read data from multiple csv files.

Tensorflow Python reading 2 files

Addendum:

Not sure how relevant the problem is as of today.. After seeing the comment that @markdjthomas mentions that the problem is slightly different here and he needs to read several rows instead of one at a time. The following example can come handy as well. Sharing here, just in case anyone else needs it too...

import tensorflow as tf
import numpy as np
from tensorflow.contrib.data.python.ops import sliding

sequence = np.array([ [[1]], [[2]], [[3]], [[4]], [[5]], [[6]], [[7]], [[8]], [[9]] ])
labels = [1,0,1,0,1,0,1,0,1]

# create TensorFlow Dataset object
data = tf.data.Dataset.from_tensor_slices((sequence, labels))

# sliding window batch
window_size = 3
window_shift = 1
data = data.apply(sliding.sliding_window_batch(window_size=window_size, window_shift=window_shift))
data = data.shuffle(1000, reshuffle_each_iteration=False)
data = data.batch(3)
#iter = dataset.make_initializable_iterator()
iter = tf.data.Iterator.from_structure(data.output_types, data.output_shapes)
el = iter.get_next()
# create initialization ops 
init_op = iter.make_initializer(data)

NR_EPOCHS = 3
with tf.Session() as sess:
    for e in range (NR_EPOCHS):
      print("\nepoch: ", e, "\n")
      sess.run(init_op)
      print("1  ", sess.run(el))
      print("2  ", sess.run(el))
      print("3  ", sess.run(el))

And the output...

epoch:  0 

1   (array([[[[5]],

        [[6]],

        [[7]]],


       [[[4]],

        [[5]],

        [[6]]],


       [[[1]],

        [[2]],

        [[3]]]]), array([[1, 0, 1],
       [0, 1, 0],
       [1, 0, 1]], dtype=int32))
2   (array([[[[3]],

        [[4]],

        [[5]]],


       [[[2]],

        [[3]],

        [[4]]],


       [[[7]],

        [[8]],

        [[9]]]]), array([[1, 0, 1],
       [0, 1, 0],
       [1, 0, 1]], dtype=int32))
3   (array([[[[6]],

        [[7]],

        [[8]]]]), array([[0, 1, 0]], dtype=int32))

epoch:  1 

1   (array([[[[1]],

        [[2]],

        [[3]]],


       [[[6]],

        [[7]],

        [[8]]],


       [[[2]],

        [[3]],

        [[4]]]]), array([[1, 0, 1],
       [0, 1, 0],
       [0, 1, 0]], dtype=int32))
2   (array([[[[5]],

        [[6]],

        [[7]]],


       [[[3]],

        [[4]],

        [[5]]],


       [[[7]],

        [[8]],

        [[9]]]]), array([[1, 0, 1],
       [1, 0, 1],
       [1, 0, 1]], dtype=int32))
3   (array([[[[4]],

        [[5]],

        [[6]]]]), array([[0, 1, 0]], dtype=int32))

epoch:  2 

1   (array([[[[1]],

        [[2]],

        [[3]]],


       [[[5]],

        [[6]],

        [[7]]],


       [[[2]],

        [[3]],

        [[4]]]]), array([[1, 0, 1],
       [1, 0, 1],
       [0, 1, 0]], dtype=int32))
2   (array([[[[4]],

        [[5]],

        [[6]]],


       [[[3]],

        [[4]],

        [[5]]],


       [[[7]],

        [[8]],

        [[9]]]]), array([[0, 1, 0],
       [1, 0, 1],
       [1, 0, 1]], dtype=int32))
3   (array([[[[6]],

        [[7]],

        [[8]]]]), array([[0, 1, 0]], dtype=int32))

Reading CSV file by using Tensorflow Data API and Splitting Tensor , In the first step, tf.data reads the CSV file and creates a Dataset object All that's left now is to specify a batch size using batch(): It's now your turn to transform this dataset. use map() to convert from square meters to square feet. to think of your model's data input pipeline as an 'Extract, Transform, and Load' process. The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.

Intro to Data Input Pipelines with tf.data, Load CSV with tf.data | TensorFlow Core So, it will be very useful to learn how to load a data from a CSV file. tf.data.experimental.make_csv_dataset reads CSV file into a dataset in a batch represented as (features or examples,labels) Then, 'features dictionary' maps the column names to 'tensors'� The CSV file is a popular format for storing tabular data. The Dataset API provide a class to extract records from one or more CSV files. Given one or more filenames and a list of defaults, a CsvDataset will produce a tuple of elements whose types correspond to the types of the defaults provided, per CSV record.

[Tensorflow 2.0] Load CSV to tensorflow | by A Ydobon, The Tensor flow DataSet API is the best way to feed data into your models. It also ensures that the Reading the data from CSV or text files or Numpy array and transforming it, shuffling it batch it. It's all be 19. 20. 21. 22. import tensorflow as tf sales_dataset = sales_dataset.map(map_func=sales_map). Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt specifies too many or too large of files?

How to read CSV file data using TensorFlow DataSet API , 摘自tf.data.experimental.make_csv_dataset | TensorFlow Core r2.0. This tutorial provides an example of how to load CSV data from a file into a tf.data.Dataset . The features dictionary maps feature column names to Tensor s for batch, label in dataset.take(1): for key, value in batch.items(): print("{:20s}:� Pre-trained models and datasets built by Google and the community

Comments
  • I don't understand the problem you describe with record_defaults. Can you elaborate?
  • @mikkola sure. As far as I can tell, reading the CSV line by line would require that I create a list of length 200 every time I wanted to read a file (record_defaults = [[0],[0],...,[0]]) and then do something like cols = tf.decode_csv(csv_row, record_defaults=record_defaults) and data = tf.stack(cols). Which seemed like a lot of overhead for every file.
  • Ah, I see. Could still be worth a try? You only need to create one constant tensor to do that, and it can be shared between calls, right? Another option I have had success with was to read the whole file contents using tf.read_file, then split it appropriately (see tf.string_split) or directly interpret as CSV using tf.decode_csv.
  • I list of 200 tensors does not sound bad at all and you can reuse the same tf.constant(0) tensor. I would definitely give it a try.
  • Link is 404. In the future, please post the relevant code.
  • Thanks for your input but this isn't really what I am looking for as each line is not a unique training example. The entire file is a single training example (ie: a 2d array).