Hot questions for Using Neural networks in shuffle

Question:

I manually built a data generator that yields a tuple of [input, target] each call. I set my generator to shuffle the training samples every epoch. Then I use fit_generator to call my generator, but confuse at the "shuffle" argument in this function:

fit_generator(self, generator, steps_per_epoch=None, epochs=1, verbose=1, callbacks=None, validation_data=None, validation_steps=None, class_weight=None, max_queue_size=10, workers=1, use_multiprocessing=False, shuffle=True, initial_epoch=0)

From Keras API:

shuffle: Whether to shuffle the order of the batches at the beginning of each epoch. Only used with instances of Sequence (keras.utils.Sequence)

I thought "shuffle" should be the job of the generator. How can it shuffle the order of the batches when my custom generator decides which batch to be output in each iteration?


Answer:

As the documentation you quoted says, the shuffle argument is only relevant for a generator that implements keras.utils.Sequence.

If you are using a "simple" generator (such as keras.preprocessing.image.ImageDataGenerator, or your own custom non-Sequence generator), than that generator implements a method that return a single batch (using yield - you can learn more about it in this question). Therefore, only the generator itself controls what batch is returned.

keras.utils.Sequence was introduced to support multiprocessing:

Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

To that end, you need to implement a method that return a batch by a batch index (which allows synchronization of multiple workers): __getitem__(self, idx). If you enable the shuffle argument, the __getitem__ method will be invoked with indexes in a random order.

However, you may also set it to false, and shuffle yourself by implementing the on_epoch_end method.

Question:

I convert my image data to caffe db format (leveldb, lmdb) using C++ as example I use this code for imagenet.

Is data need to be shuffled, can I write to db all my positives and then all my negatives like 00000000111111111, or data need to be shuffled and labels should look like 010101010110101011010?

How caffe sample data from DB, is it true that it use random subset of all data with size = batch_size?


Answer:

Should you shuffle the samples? Think about the learning process if you don't shuffle; caffe sees only 0 samples - what do you expect the algorithm to deduce? simply predict 0 all the time and everything is cool. If you have plenty of 0 before you hit the first 1 caffe will be very confident in predicting always 0. It will be very difficult to move the model from this point. On the other hand, if it constantly sees a mix of 0 and 1 it learns from the beginning meaningful features for separating the examples. Bottom line: it is very advantageous to shuffle the training samples, especially when using SGD-based approaches.

AFAIK, caffe does not randomly sample batch_size samples, but rather goes sequentially over the input DB batch_size after batch_size samples.

TL;DR shuffle.

Question:

I am trying to train a CNN with images created during program execution. I have a game environment (not created by me) that generates screen images that depend on actions taken in the game. The actions are controlled by the learnt CNN.

These images are then pushed into a RandomShuffleQueue, from which mini batches are dequeued and used to train the CNN on the correct action. I would like to do this (game play and training) asynchronously, where the game is being played and screens of it are added to the RandomShuffleQueue in a separate thread that the one used to train the model. Here is a very simplified version of what I am trying.

import tensorflow as tf
from game_env import game


experience = tf.RandomShuffleQueue(10000,
                                1000, tf.float32,
                                shapes = [32,32],  
                                name = 'experience_replay')

def perceive(game):
    rawstate = game.grab_screen()
    enq = experience.enqueue(rawstate)
    return enq

#create threads to play the game and collect experience
available_threads = 4
coord = tf.train.Coordinator()
experience_runner = tf.train.QueueRunner(experience,
                                [perceive(game()) for num in range(available_threads)])

sess = tf.Session()
sess.run(tf.initialize_all_variables())
enqueue_threads = experience_runner.create_threads(sess, coord = coord, start = True)

with sess.as_default():
    while(1):
        print sess.run(experience.dequeue())
        time.sleep(.5)

Meanwhile, the game_env looks like this:

import tensorflow as tf
class game(object): 
    def __init__(self):
        self.screen_size = [32,32]
        self.counter = 0

    def grab_screen(self):
        """current screen of the game"""
        self.counter += 1
        screen = self.counter*tf.ones(self.screen_size)
        return screen

As you can see, the game environment is really simple as of now: every time a screen grab is performed, a counter is incremented and an image filled with the counter (of the correct size) is returned.

It should be noted that I wrote the above class just for testing and in general grab_screen can return any numpy nd-array. Moreover it is not written by me so I can just call grab_screen and not make any changes inside the real thing.

Now, the problem is experience queue seems to be holding only tensors of ones (i.e. the counter only gets updated once!!)

Sample output:

I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 4
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 4

[[ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 ...,

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]]

[[ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 ...,

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]]

[[ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 ...,
 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]

 [ 1.  1.  1. ...,  1.  1.  1.]]

and so on. My question is: how do I dynamically create the input image to be enqueued to the RandomShuffleQueue like this? Thanks!


Answer:

The problem can be traced to this line, which defines the tf.train.QueueRunner:

experience_runner = tf.train.QueueRunner(
    experience, [perceive(game()) for num in range(available_threads)])

This creates four (available_threads) ops that, each time any of them runs, will enqueue a tensor filled with 1.0 to the experience queue. Stepping through what happens in the list comprehension should make this clearer. The following happens four times:

  1. A game object is constructed.
  2. It is passed to perceive().
  3. perceive() calls game.grab_screen() once, which increments the counter, and returns a tensor 1 * tf.ones(self.screen_size)
  4. percieve() passes this tensor to experience.enqueue() and returns the resulting op.

The QueueRunner.create_threads() call creates one thread per enqueue op, and these run in an infinite loop (blocking when the queue reaches capacity).

To have the desired effect, you should use the feed mechanism and a placeholder to pass a different value for the grabbed screen each time you enqueue an experience. It depends on how your game class is implemented, but you probably also want to initialize a single instance of that class as well. Finally, it's not clear whether you want multiple enqueuing threads, but let's assume that game.grab_screen() is thread-safe and permits some concurrency. Given all this, a plausible version looks like the following (note that you'll need to create a custom thread rather than a QueueRunner to use feeding):

import tensorflow as tf
from game_env import game

experience = tf.RandomShuffleQueue(10000,
                                   1000, tf.float32,
                                   shapes=[32,32],  
                                   name='experience_replay')

screen_placeholder = tf.placeholder(tf.float32, [32, 32])
# You can create a single enqueue op and dequeued tensor, and reuse these from
# multiple threads.
enqueue_op = experience.enqueue(screen_placeholder)
dequeued_t = experience.dequeue()
# ...

init_op = tf.initialize_all_variables()

game_obj = game()

sess = tf.Session()
coord = tf.train.Coordinator()

# Define a custom thread for running the enqueue op that grabs a new
# screen in a loop and feeds it to the placeholder.
def enqueue_thread():
    with coord.stop_on_exception():
        while not coord.should_stop():
            screen_val = game_obj.grab_screen()
            # Run the same op, but feed a different value for the screen.
            sess.run(enqueue_op, feed_dict={screen_placeholder: screen_val}) 

available_threads = 4
for _ in range(available_threads):
    threading.Thread(target=enqueue_thread).start()


while True:
    # N.B. It's more efficient to reuse the same dequeue op in a loop.
    print sess.run(dequeued_t)
    time.sleep(0.5)

Question:

New with Tensorflow, I'm using neural networks to classify images. I've got a Tensor that contains images, of shape [N, 128, 128, 1] (N images 128x128 with 1 channel), and a Tensor of shape [N] that contains the labels of the images.

I want to shuffle it all and split it between training and testing tensors (let's say 80%-20%). I didn't find a way to 'zip' my tensors to associate each image with its label (in order to shuffle images and labels the same way). Is it possible ? If not, how can I achieve that shuffling/splitting job ?

Thanks for any help !


Answer:

Just use the same 'seed' keyword parameter value, say seed=8 in function tf.random_shuffle for both labels and data.

ipdb> my_data = tf.convert_to_tensor([[1,1], [2,2], [3,3], [4,4], 
[5,5], [6,6], [7,7], [8,8]])
ipdb> my_labels = tf.convert_to_tensor([1,2,3,4,5,6,7,8])
ipdb> sess.run(tf.random_shuffle(my_data, seed=8))
array([[5, 5],
   [3, 3],
   [1, 1],
   [7, 7],
   [2, 2],
   [8, 8],
   [4, 4],
   [6, 6]], dtype=int32)
ipdb> sess.run(tf.random_shuffle(my_labels, seed=8))
array([5, 3, 1, 7, 2, 8, 4, 6], dtype=int32)

EDIT: if you need random shuffling in runtime, where batches, say, will be shuffled randomly but differendly, you may use such a trick:

# each time shuffling pattern will be differend

# for now, it works
indicies = tf.random_shuffle(tf.range(8))
params = tf.convert_to_tensor([111, 222, 333, 444, 555, 666, 777, 888])
sess.run(tf.add(tf.gather(params, indicies), tf.gather(params, indicies) * 1000))
> array([555555, 444444, 666666, 222222, 111111, 888888, 333333, 777777], dtype=int32)

numbers consisting of the same digits show, that gather<-indicies take the same seed value

Question:

I have read some posts on various CV approaches. But what I don't understand is why shuffling the data in the function leads to significant increase in accuracy and when it is correct to do it.

In my times series dataset of size 921 *10080 where each row is a time series of water temperature of a particular location in an area and the 2 last columns being the labels with 2 groups, ie. high risk(high bacteria level in water) and low risk(low bacteria bacteria in water), accuracy varies very differently based on if I set "shuffle=True"(achieved accuracy of around 75%), versus accuracy of 50% when setting "shuffle=False" in StratifiedKFold as shown below:

n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)

The sklearn documentations states the following:

A note on shuffling

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:

• This consumes less memory than shuffling the data directly.

• By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.

• The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.

• To get identical results for each split, set random_state to an integer.

I am not sure if I interpret the documentation correctly- an explanation is much appreciated. Besides, I have a few questions:

1)Why is there such huge improvement in accuracy after shuffling? Am I overfitting? When should I shuffle?

2)Given that all samples are collected from the same area, they are probably not independent. How does this affect shuffling? Is it still valid to shuffle?

3)Does shuffling separate the labels from their corresponding X data? (Answer update : No. Shuffling does not separate labels from their corresponding X data)

Thx


Answer:

Your question is quite tricky and probably it is better placed here.

In my times series dataset of size 921 *10080 where each row is a time series of water temperature of a particular location in an area and the last column being the label with 2 groups

Aren't you using using a classification problem with timeseries futures? You are using dependeten variables (timeseries of the water temperature) to predict a label. For me this sounds risky, and I would assume that there is not a good chance to predict the label. Just one scenario to think about:

Location  Time1 Time2 Time3  Label
A         3       2    1      1
B         100     99   98     1
C         98      99   100    0

So in this example label 1 is a timeseries which goes down and label 0 is a timeseries that goes up, but I would bet every classifier has a problem to learn it without connecting the trending component of your columns.

To come back to your question, this can help you to understand shuffling: difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

Question:

I want to train a neural network using backpropagation, and I have a data set like this:

Should I shuffle the input data?


Answer:

Yes, and it should be shuffled at each iteration, e.g. quote from {1}:

As for any stochastic gradient descent method (including the mini-batch case), it is important for efficiency of the estimator that each example or minibatch be sampled approximately independently. Because random access to memory (or even worse, to disk) is expensive, a good approximation, called incremental gradient (Bertsekas, 2010), is to visit the examples (or mini-batches) in a fixed order corresponding to their order in memory or disk (repeating the examples in the same order on a second epoch, if we are not in the pure online case where each example is visited only once). In this context, it is safer if the examples or mini-batches are first put in a random order (to make sure this is the case, it could be useful to first shuffle the examples). Faster convergence has been observed if the order in which the mini-batches are visited is changed for each epoch, which can be reasonably efficient if the training set holds in computer memory.

{1} Bengio, Yoshua. "Practical recommendations for gradient-based training of deep architectures." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.

Question:

train_image_paths = [str(path) for path in list(train_path.glob('*/*.jpeg'))]
random.shuffle(train_image_paths)

Above is a sample code you can see.

I have the same question in this case too:

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64).shuffle(10000)

I don't understand why I need the shuffle in these cases.


Answer:

In the obvious case, shuffling is helpful if your training data is sorted by class labels. By shuffling, you allow your model to "see" a wide range of data points each belonging to different classes in the context of classification. If the model goes through a sorted training data, your model runs the risk of overfitting to certain classes. In short, shuffling helps reduce variance and ensures that the train, test, and validation sets are representative of the true distribution.

Question:

Is shuffling done by setting the flag --shuffle as below as found in create_imagenet.sh ? :

GLOG_logtostderr=1 $TOOLS/convert_imageset \
   --resize_height=$RESIZE_HEIGHT \
   --resize_width=$RESIZE_WIDTH \
   --shuffle \

I mean I don't need to shuffle it manually afterwards, if the flag does it already. What about the label, is it shuffled automatically in the generated lmdb file?


Answer:

Using convert_imageset tool creates a copy of your training/validation data in a binary database file (either in lmdb or leveldb format). The data encoded in the dataset includes pairs of example and its corresponding label. Therefore, when shuffle-ing the dataset the labels are shuffled with the data to maintain the correspondence between data and its ground-truth label. There is no need to shuffle the data again during training.