Feeding .npy (numpy files) into tensorflow data pipeline

tfdata tfrecorddataset
tensorflow dataset size
tensorflow batch
tensorflow custom dataset
tensorflow data preprocessing
tensorflow custom data generator
tensorflow numpy
datasetv1adapter to numpy array

Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory.

Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.

Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

Consuming NumPy arrays

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

Here is a post with some instructions.

FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.

Load NumPy data, In TensorFlow's guide of "Importing Data": between Apache Arrow and Numpy in memory, another is to read data from npy or npz file format. If I were to use tf.read_file(filename) on this filename, then according to the documentation, this function returns a string Tensor (a byte array). How can I convert this into a Tensor representing the data in the NumPy array? Is there an equivalent function to tf.image.decode_image() for decoding a NumPy array?

It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that an float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:

def npy_header_offset(npy_path):
    with open(str(npy_path), 'rb') as f:
        if f.read(6) != b'\x93NUMPY':
            raise ValueError('Invalid NPY file.')
        version_major, version_minor = f.read(2)
        if version_major == 1:
            header_len_size = 2
        elif version_major == 2:
            header_len_size = 4
        else:
            raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
        header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
        header = f.read(header_len)
        if not header.endswith(b'\n'):
            raise ValueError('Invalid NPY file.')
        return f.tell()

With this you can create a dataset like this:

import tensorflow as tf

npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)

Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:

dataset = dataset.map(lambda s: tf.decode_raw(s, dtype))

The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:

dataset = dataset.map(lambda s: tf.reshape(tf.decode_raw(s, dtype), (num_features,)))

Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.

The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.

Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...

tf.data: Build TensorFlow input pipelines, This method of feeding data into your network in TensorFlow is inefficient and will with the Dataset API i.e. reading the data from arrays and files, transforming it, numpy arrays and numpy files, the TFRecord format and direct from text files. Beginner’s guide to feeding data in Tensorflow — Part2 ## import necessary stuff import tensorflow as tf import numpy as np import os,sys Now that we have created the data input

You can do it with tf.py_func, see the example here. The parse function would simply decode the filename from bytes to string and call np.load.

Update: something like this:

def read_npy_file(item):
    data = np.load(item.decode())
    return data.astype(np.float32)

file_list = ['/foo/bar.npy', '/foo/baz.npy']

dataset = tf.data.Dataset.from_tensor_slices(file_list)

dataset = dataset.map(
        lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))

Support NumPy for tensorflow-io · Issue #68 · tensorflow/io · GitHub, In Part 2, the dataset images were saved into NumPy arrays. In this tutorial, these arrays will be fed to the model. For the training data, 2 files were produced which are: train_dataset_array.npy: Training images. import numpy import keras import os import tensorflow as tf def images_to_array(dataset_dir  I want to train a model in tf.keras of Tensorflow 2.0 with data that is bigger than my ram, but the tutorials only show examples with predefined datasets. I followed this tutorial: Load Images wi

TensorFlow Dataset tutorial, Data can be feed into TensorFlow using iterator. import numpy as np import tensorflow as tf dt = np.dtype([('features', float, (2,)), ('label', int)]) x with np.load("​in.npy") as data: features = data["features"] labels = data["labels"]  Dataset created using this method will emit only one data at a time. # source data - numpy array data = np.arange(10) # create a dataset from numpy array dataset = tf.data.Dataset.from_tensor_slices(data) The object dataset is a tensorflow Dataset object. from_tensors: It also accepts single or multiple numpy arrays or tensors. Dataset created

Part 4: Image Classification using Features Extracted by Transfer , npy file and feed it into API (for example, https://stackoverflow.com/questions/​48889482/feeding-npy-numpy-files-into-tensorflow-data-pipeline) or if you have the  Generally Speaking, there are 3 ways in which you can get data into your model: Use a feed_dict command, where you override an input tensor using an input array. This method is widely covered in online tutorials, but has the disadvantage of being very slow, for a bunch of reasons.

“TensorFlow - Importing data”, Feeding .npy (numpy files) into tensorflow data pipeline. You can do it with tf.​py_func, see the example here. The parse function would simply  The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training. The pipeline for a text model might involve

Comments
  • I have seen that guide but unfortunately, it doesn't fit in memory!
  • Thank you very much but converting my numpy files to TFRecord is the last thing i want to do since i have around 5,000,000 files and it would take a long time to do that. I think i will go with the keras generator idea. Thanks again!
  • Each file of your 5,000,000 files doesn't fit into memory?
  • "The parse function would simply decode the filename from bytes to string and call np.load." can you please provide a code for this?