## Why is TensorFlow's `tf.data` package slowing down my code?

I'm just learning to use TensorFlow's `tf.data`

API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.

**Tl;Dr**: With 100,000 training data, `tf.data`

slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.

**My question:** What is going on? Is my implementation flawed? Other sources I've read have `tf.data`

improving speeds by about 30%.

import tensorflow as tf import numpy as np import timeit import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' tf.logging.set_verbosity(tf.logging.ERROR) n_epochs = 10 input_dimensions_list = [10] def function_to_approximate(x): return np.dot(x, random_covector).astype(np.float32) + np.float32(.01) * np.random.randn(1,1).astype(np.float32) def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels): tf.reset_default_graph() weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32)) X = tf.placeholder(tf.float32, shape=(None, input_dimension), name='X') Y = tf.placeholder(tf.float32, shape=(None, 1), name='Y') prediction = tf.matmul(X,weights) loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y))) loss_op = tf.train.AdamOptimizer(.01).minimize(loss) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for _ in range(n_epochs): sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels}) def regress_with_tfData(n_epochs, input_dimension, training_inputs, training_labels, batch_size): tf.reset_default_graph() weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32)) X,Y = data_set.make_one_shot_iterator().get_next() prediction = tf.matmul(X, weights) loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y))) loss_op = tf.train.AdamOptimizer(.01).minimize(loss) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) while True: try: sess.run(loss_op) except tf.errors.OutOfRangeError: break for input_dimension in input_dimensions_list: for data_size in [500, 100000]: training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32) random_covector = np.random.randint(-5, 5, size=(input_dimension, 1)) training_labels = function_to_approximate(training_inputs) print("Not using tf.data, with data size " "{}, input dimension {} and training with " "a full batch, it took an average of " "{} seconds to run {} epochs.\n". format( data_size, input_dimension, timeit.timeit( lambda: regress_without_tfData( n_epochs, input_dimension, training_inputs, training_labels ), number=3 ), n_epochs)) for input_dimension in input_dimensions_list: for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]: training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32) random_covector = np.random.randint(-5, 5, size=(input_dimension, 1)) training_labels = function_to_approximate(training_inputs) data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels)) data_set = data_set.repeat(n_epochs) data_set = data_set.batch(batch_size) print("Using tf.data, with data size " "{}, and input dimension {}, and training with " "batch size {}, it took an average of {} seconds " "to run {} epochs.\n". format( data_size, input_dimension, batch_size, timeit.timeit( lambda: regress_with_tfData( n_epochs, input_dimension, training_inputs, training_labels, batch_size ), number=3 )/3, n_epochs ))

This outputs for me:

Not using tf.data, with data size 500, input dimension 10 and training with a full batch, it took an average of 0.20243382899980134 seconds to run 10 epochs.

Not using tf.data, with data size 100000, input dimension 10 and training with a full batch, it took an average of 0.2431719040000644 seconds to run 10 epochs.

Using tf.data, with data size 500, and input dimension 10, and training with batch size 50, it took an average of 0.09512088866661846 seconds to run 10 epochs.

Using tf.data, with data size 500, and input dimension 10, and training with batch size 500, it took an average of 0.07286913600000844 seconds to run 10 epochs.

Using tf.data, with data size 100000, and input dimension 10, and training with batch size 50, it took an average of 4.421892363666605 seconds to run 10 epochs.

Using tf.data, with data size 100000, and input dimension 10, and training with batch size 100000, it took an average of 2.2555197536667038 seconds to run 10 epochs.

**Edit:** Fixed an important issue that Fred Guth pointed out. It didn't much affect the results, though.

That is because you are comparing apples with bananas.

On one hand, when using placeholders, you are providing a monolithic tensor as is. On the other hand, when using `Dataset`

, you are slicing the tensor into individual samples. This is very different.

The equivalent of providing a monolothic placeholder tensor with the `Dataset`

pipeline is by using `tf.data.Dataset.from_tensors`

. **When I use from_tensors in your example, I get similar (actually smaller) computation times than with placeholders.**

If you want to compare a more sophisticated pipeline using `from_tensor_slices`

, you should use a fair comparison with placeholders. For example, shuffle your data. Add some preprocessing on your slices. I have no doubt you will observe the performance gain that makes people switch to this pipeline.

**Top 5 Use Cases of TensorFlow,** to perform complex numerical operations and several other tasks to model Deep Learning models. It's architecture allows easy deployment of computations across multiple platforms like CPU's, GPU's, etc. TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.

I wanted to test the dataset API which seems to be really convenient for processing data. I did a lot of time testing about this API in CPU, GPU and multi-GPU way for small and large NN with different type of data.

First thing, It seems to me that your code is ok. But I need to point that your NN is just one simple layer.

Now, the dataset API is not suitable for your type of NN but for NN with a lot more complexity. Why ? For several reasons that I explain below (founded in my quest of understanding the dataset API).

Firstly, in one hand the dataset API **processes data each batch** whereas in the other hand **data are preprocessed**. Therefore, if it fits your RAM, you can save time by preprocessing the data. Here your data are just to "simple". If you want to test what i am saying, try to find a really really big dataset to process. Nevertheless, the dataset API can be tuned with prefetching data. You can take a look to this tutorial that explain really well why it is good to process data with prefetch.

Secondly, in my quest of dataset API for Multi-GPU training, I discovered that as far as i know **the old pre-processing way is faster than dataset API for small Neural Network**. You can verify that by creating a simple stackable RNN which take a sequence in input. You can try different size of stack (i have tested 1, 2, 10 and 20). You will see that, using the dataset API, on 1-GPU or on 4-GPUs, the time did not differ for small RNN stacks (1, 2 and 5).

To summarize, **the dataset API is suitable for Neural Network that have data that can't be pre-process**. Depending on your task, it may be more convenient to pre-process data, for example if you want to tweak your NN in order to improve it. I agree that the dataset API is really cool for batch, padding and also convenient for shuffling large amount of data but it's also not suitable for multi-GPU training.

**What is TensorFlow used for?,** nice documentation where you'll get all the necessary information and latest updates. Why is TensorFlow so popular for machine learning systems? A: There's a big trend happening in machine learning (ML) – programmers are flocking toward a tool called TensorFlow , an open-source library product that facilitates some of the key work inherent in building and using training data sets in ML.

First:

You are recreating the dataset unnecessarily.

`data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))`

Create the dataset prior to the loop and change the `regress_with_tfData`

input signature to use dataset instead of `training_inputs`

and `training_labels`

.

Second:

The problem here is that minibatches of size 50 or even 500 are too small to compensate the cost of td.data building latency. You should increase the minibatch size. Interestingly you did so with a minibatch of size 100000, but then maybe it is too big ( I am not certain of this, I think it would need more tests).

There are a couple of things you could try:

1) Increase the minibatch size to something like 10000 and see if you get an improvement 2) Change your pipeline to use an iterator, example:

data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels)) data_set = data_set.repeat(n_epochs) data_set = data_set.batch(batch_size) iterator = data_set.make_one_shot_iterator() .... next_element = iterator.get_next()

**Why is Tensorflow so popular?,** is available on 64-bit Linux, macOS, Windows, and mobile computing platforms including Android and iOS. Its flexible architecture allows for the easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. TensorFlow is a library developed by the Google Brain Team to accelerate machine learning and deep neural network research. It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several wrappers in several languages like Python, C++ or Java. In this tutorial, you will learn . What is TensorFlow?

One possible thing you are missing is a prefetch. Add a prefetch of 1 at the end of your data pipeline like so:

data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels)) data_set = data_set.repeat(n_epochs) data_set = data_set.batch(batch_size).prefetch(1)

Adding a prefetch of 1 at the end of your dataset pipeline means you try and fetch 1 batch of data while training is happening. This way you wont be waiting around while the batch is prepared, it should be ready to go as soon as each train iteration is done.

**TensorFlow,** TensorFlow offers multiple levels of abstraction so you can choose the right one for your needs. Build and train models by using the high-level Keras API, which TensorFlow is a Python-friendly open source library for numerical computation that makes machine learning faster and easier

**Why TensorFlow,** TensorFlow is a Python-friendly open source library for numerical computation that makes machine learning faster and easier. TensorFlow is a Python library for fast numerical computing created and released by Google. It is a foundation library that can be used to create Deep Learning models directly or by using wrapper libraries that simplify the process built on top of TensorFlow.

**What is TensorFlow? The machine learning library explained ,** TensorFlow was released in 2015 by google as an open source library. Tensor Flow makes it easier for developers to design, build, and train deep learning TensorFlow is a framework that represents complex computations as graphs, this makes it easier for analysis of models, multi-dimensional arrays called Tensors are used to do the same. TensorFlow was released in 2015 by google as an open source library. Tensor Flow makes it easier for developers to design,

**What is TensorFlow?,** TensorFlow is an open source software library for numerical computation using data-flow graphs. It was originally developed by the Google TensorFlow allows coders to iterate quickly, train models faster and run more experiments On the production end— teams can run TensorFlow on large scale server farms embedded on devices, CPUs, GPUs, TPUs