how to shuffle a Concatenated Tensorflow dataset

tensorflow dataset size
tensorflow custom dataset
tensorflow dataset to numpy
tensorflow ragged tensor dataset
how to load your data in tensorflow
tensorflow input_fn
tfdata tfrecorddataset
tf dataset keras

I have multiple tensorflow datasets that have the same structure. I want to combine them to a single dataset. using tf.dataset.concatenate

but i found that when shuffling this combined dataset, the dataset is not shuffled on the scale of whole datasets. But shuffled in each separated dataset.

Is there any method to solve this?


When you concatenate two Datasets, you get the elements of the first then the elements of the second. If you shuffle the result, you will not get a good mix if your shuffling buffer is smaller than the size of your Dataset.

What you need instead is to interleave samples from your dataset. The best way if you are using TF >= 1.9 is to use the dedicated tf.contrib.data.choose_from_datasets function. An example straight from the docs:

datasets = [tf.data.Dataset.from_tensors("foo").repeat(),
            tf.data.Dataset.from_tensors("bar").repeat(),
            tf.data.Dataset.from_tensors("baz").repeat()]

# Define a dataset containing `[0, 1, 2, 0, 1, 2, 0, 1, 2]`.
choice_dataset = tf.data.Dataset.range(3).repeat(3)

result = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)

It is probably better to shuffle the input datasets if preserving the sample order and/or their ratios in a batch is important.

If you are using an earlier version of TF, you could rely on a combination of zip, flat_map and concatenate like this:

a = tf.data.Dataset.range(3).repeat()
b = tf.data.Dataset.range(100, 105).repeat()

value = (tf.data.Dataset
  .zip((a, b))
  .flat_map(lambda x, y: tf.data.Dataset.concatenate(
    tf.data.Dataset.from_tensors([x]),
    tf.data.Dataset.from_tensors([y])))
  .make_one_shot_iterator()
  .get_next())

sess = tf.InteractiveSession()

for _ in range(10):
  print(value.eval())

tf.data.Dataset, I have multiple tensorflow datasets that have the same structure. I want to combine them to a single dataset. using tf.dataset.concatenate. but i found that when  dataset = tf.data.Dataset.list_files(os.listdir('path')) dataset = tf.data.TextLineDataset(dataset) Dataset API also has concatenate method. dataset = dataset_1.concatenate(dataset_2) but it's not completely clear wether you need it Edit: list_files will create dataset with filenames dataset = tf.data.Dataset.list_files(['f1.csv', 'f2.csv'])


Not 100% sure, but you might want to look into the order in which you call different operations on your dataset object. The behaviour of shuffle() can vary dependent on the order. See also this question which might be related.

tf.random.shuffle, callbacks. Overview · TensorBoard. constraints. Overview. datasets. Overview. boston_housing. Overview. cifar10. Overview. cifar100. Overview. fashion_mnist. r/tensorflow: TensorFlow is an open source Machine Intelligence library for numerical computation using Neural Networks. Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts


What is your shuffle buffer size?

For example, if you have 3 datasets, each containing 1000 items, then you need to apply shuffle(3000) to randomize the order of all the items.

Here is an example:

This should shuffle all the 3000 items:

dataset = dataset1.concatenate(dataset2).concatenate(dataset3)
dataset = dataset.shuffle(3000)

However, this will not shuffle the whole dataset:

dataset1 = dataset1.shuffle(1000)
dataset2 = dataset2.shuffle(1000)
dataset3 = dataset3.shuffle(1000)
dataset = dataset1.concatenate(dataset2).concatenate(dataset3)

tf.data.experimental.shuffle_and_repeat, Args: value : A Tensor to be shuffled. seed : A Python integer. Used to create a random seed for the  If you don't mind running a session during the construction of the new dataset, you can do the following: import tensorflow as tf import numpy as np ds1 = tf.data.Dataset.from_tensor_slices([5,5,5,5,5]) ds2 = tf.data.Dataset.from_tensor_slices([4,4]) ds1 = ds1.batch(2) ds2 = ds2.batch(1) iter1 = ds1.make_one_shot_iterator() iter2 = ds2.make_one_shot_iterator() batch1 = iter1.get_next() batch2


Starting from tensorflow 1.9 you can also make use of the sample_from_datasets method.

For example, the following code

datasets = [tf.data.Dataset.from_tensors("foo").repeat(3).apply(tf.data.experimental.enumerate_dataset()).repeat(),
        tf.data.Dataset.from_tensors("bar").repeat(3).apply(tf.data.experimental.enumerate_dataset()).repeat(),
        tf.data.Dataset.from_tensors("baz").repeat(3).apply(tf.data.experimental.enumerate_dataset()).repeat()]

dataset = tf.data.experimental.sample_from_datasets(datasets) # from 1.12
# dataset = tf.contrib.data.sample_from_datasets(datasets) # between 1.9 and 1.12

iterator = dataset.make_one_shot_iterator();next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(10):
        print(sess.run(next_element))

will print

(0, b'bar')
(0, b'foo')
(1, b'bar')
(0, b'baz')
(2, b'bar')
(1, b'foo')
(1, b'baz')
(2, b'foo')
(2, b'baz')
(0, b'foo')

How does keras train without disrupting the data set order, For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000  I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue.


TensorFlow Datasets, creates a dataset with a separate element for each row of the input tensor: t = tf. TensorFlow Dataset Shuffle Each Epoch. In the manual on the Dataset class in Tensorflow, it shows how to shuffle the data and how to batch it. However, it's not apparent how one can shuffle the data each epoch. I've tried the below, but the data is given in exactly the same order the second epoch as in the first.


What is the difference between Dataset.from_tensors and Dataset , Assuming the datasets are infinite (so concat + shuffle doesn't help) - Is that possible? Couldn't find Hi Zach,. TensorFlow 1.9 introduces the  Randomly shuffles a tensor along its first dimension. See Migration guide for more details. tf.compat.v1.random.shuffle, tf.compat.v1.random_shuffle, tf.compat.v2.random.shuffle. The tensor is shuffled along dimension 0, such that each value [j] is mapped to one and only one output [i]. For example, a mapping that might occur for a 3x2 tensor is:


tf.data.Dataset, Creates a Dataset by concatenating given dataset with this dataset. The order of the file names returned can be non-deterministic even when shuffle is False . t1 = [ [1, 2, 3], [4, 5, 6]] t2 = [ [7, 8, 9], [10, 11, 12]] concat ( [t1, t2], 0) <tf.Tensor: shape= (4, 3), dtype=int32, numpy= array ( [ [ 1, 2, 3], [ 4, 5, 6