Why does my training loss have regular spikes?

pytorch loss spikes
spike in loss
tensorflow loss increasing
validation loss oscillating
validation loss zig zag
validation loss very high
validation loss decreases then increases
validation loss explodes

I'm training the Keras object detection model linked at the bottom of this question, although I believe my problem has to do neither with Keras nor with the specific model I'm trying to train (SSD), but rather with the way the data is passed to the model during training.

Here is my problem (see image below): My training loss is decreasing overall, but it shows sharp regular spikes:

The unit on the x-axis is not training epochs, but tens of training steps. The spikes occur precisely once every 1390 training steps, which is exactly the number of training steps for one full pass over my training dataset.

The fact that the spikes always occur after each full pass over the training dataset makes me suspect that the problem is not with the model itself, but with the data it is being fed during the training.

I'm using the batch generator provided in the repository to generate batches during training. I checked the source code of the generator and it does shuffle the training dataset before each pass using sklearn.utils.shuffle.

I'm confused for two reasons:

  1. The training dataset is being shuffled before each pass.
  2. As you can see in this Jupyter notebook, I'm using the generator's ad-hoc data augmentation features, so the dataset should theoretically never be same for any pass: All the augmentations are random.

I made some test predictions to see if the model is actually learning anything, and it is! The predictions get better over time, but of course the model is learning very slowly since those spikes seem to mess up the gradient every 1390 steps.

Any hints as to what this might be are greatly appreciated! I'm using the exact same Jupyter notebook that is linked above for my training, the only variable I changed is the batch size from 32 to 16. Other than that, the linked notebook contains the exact training process I'm following.

Here is a link to the repository that contains the model:

https://github.com/pierluigiferrari/ssd_keras

I've figured it out myself:

TL;DR:

Make sure your loss magnitude is independent of your mini-batch size.

The long explanation:

In my case the issue was Keras-specific after all.

Maybe the solution to this problem will be useful for someone at some point.

It turns out that Keras divides the loss by the mini-batch size. The important thing to understand here is that it's not the loss function itself that averages over the batch size, but rather the averaging happens somewhere else in the training process.

Why does this matter?

The model I am training, SSD, uses a rather complicated multi-task loss function that does its own averaging (not by the batch size, but by the number of ground truth bounding boxes in the batch). Now if the loss function already divides the loss by some number that is correlated with the batch size, and afterwards Keras divides by the batch size a second time, then all of a sudden the magnitude of the loss value starts to depend on the batch size (to be precise, it becomes inversely proportional to the batch size).

Now usually the number of samples in your dataset is not an integer multiple of the batch size you choose, so the very last mini-batch of an epoch (here I implicitly define an epoch as one full pass over the dataset) will end up containing fewer samples than the batch size. This is what messes up the magnitude of the loss if it depends on the batch size, and in turn messes up the magnitude of gradient. Since I'm using an optimizer with momentum, that messed up gradient continues influencing the gradients of a few subsequent training steps, too.

Once I adjusted the loss function by multiplying the loss by the batch size (thus reverting Keras' subsequent division by the batch size), everything was fine: No more spikes in the loss.

Explanation of Spikes in training loss vs. iterations with Adam , The spikes could be caused by many reasons: insufficient model capacity, incorrect Alternatively, you could compare the learning outcomes of different models (both Andrew Ng explains with great details in Deep Learning Course appears in the image below. He also focuses on some corners can cause this problem :. Just as soccer and football cleats offer grip on a grass field, spiked shoes provide traction and speed on a track-and they're much lighter than a training shoe. But finding the right spiked shoes

I would add gradient clipping because this prevents spikes in the gradients to mess up the parameters during training.

Gradient Clipping is a technique to prevent exploding gradients in very deep networks, typically Recurrent Neural Networks.

Most programs allows you to add a gradient clipping parameter to your GD based optimizer.

Why am I getting spikes in the values of the loss function during , This is given the loss function has those same properties. Now consider what happens when you introduce relu's (which are only piecewise differentiable) with two  Spikes gain most of their notoriety from track and field events as shoes that give runners extra traction. Different from the cleats commonly seen in sports such as soccer and football, spike shoes use removable metal or plastic spikes on the bases -- cleats typically have molded, permanent nodules on the bottoms.

For anyone working in PyTorch, an easy solution which solves this specific problem is to specify in the DataLoader to drop the last batch:

train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=False, 
                                          pin_memory=(torch.cuda.is_available()), 
                                          num_workers=num_workers, drop_last=True)

Weird loss spikes and behavior while training autoencoder, I'm training the Keras object detection model linked at the bottom of this question, although I believe my problem has to do neither with Keras nor with the  Win 10 performance issues (my solutions) Turn off the peer-to-peer updates as suggested. (this caused the big 1000 ms level spikes). Turn off one-drive (this causes spikes in the 100ms range, but depends on I/O). Turn off BITS (Background Intelligence Transfer System) - this kills windows updates, so you have to run them manually.

Why does my training loss have regular spikes?, The hyperparameters are the same and I am using the Adam optimi… While there were spikes when training with Keras, it has never as large  Ping spikes on Wi-Fi every 10 seconds. – Solved. Few days ago, after moving my home PC to another place, I bumped into a situation, where the cable connection is no longer an option and the only way to get Internet was Wi-Fi.

Training loss spikes often, is this normal? - Support, 2- You may need to increase the batch size. This way you will get a better estimate of the gradients. Note that in stochastic gradient descent, training will take  How Resistance Training Affects Your Blood Sugar If you regularly do cardio (like running, swimming, dancing, etc.), you have probably noticed that your blood sugar reacts differently depending on the type of cardio.

What causes sudden spikes in loss function?, Then what could be the reason for the spike? Temporary spikes in validation loss are normal during training. This means the current local minima  The researchers still have questions as to why there were limits on how long the women maintained the extra activity. Fitness 9 Tricks to Help You Start Working Out and Actually Stick to It

Comments
  • This is hardly a minimal reproducible example, I think this question could go on a nice diet, which will increase the probability of getting an answer :)
  • @djk47463 I agree it's hardly a compact example, but how do you create a compact example if you have a complex object detection model and the problem could lie in any part of the model? Anyways, I've solved it myself, it was a Keras-specific issue after all. Maybe this will be useful for someone at some point.
  • Usually it is safer to just drop the last batch in such cases. Even if the loss is independent of the batch size, a very small batch is more likely to contain non-representative data and hence mess up the gradients. When using mini-batch gradient descent the loss landscape is not fixed but it's changing with every batch. This can cause typical fluctuations in the loss if the batch size is too small. Only if the batch size is large enough to be representative of the whole data set the loss will stabilize. A small(er) batch at the end of the epoch can thus have a similarly negative effect.
  • @a_guest I'm not sure I agree, for three reasons. 1) Any typically used mini batch sizes are so small relative to the overall dataset that the loss manifold fluctuates extremely between mini batches no matter what. The mini batch does not need to be very representative of the whole dataset. 2) Any optimizer that uses momentum or a similar mechanism doesn't depend much on individual batches anyway. Only the average of many batches matters. 3) From an empirical perspective, training with mini batch size 1 or very small batch sizes works very well in practice.
  • To elaborate on the first argument: Whether the last batch has 32 or 7 samples, both is tiny relative to, and therefore neither will be representative of, your dataset of 50k samples (or 500k, or 5 million).
  • Gradient clipping might get the job done, but I'd argue it would not be a great idea in this case, because it would treat the symptom rather than the cause (the loss shouldn't be exploding in the first place). Besides, the solution has already been provided: In most cases like mine above, the problem will be that the loss magnitude depends on the batch size, which it shouldn't.