Distributing a Keras Model Across Multiple GPUs

keras multi gpu
keras distributed training
keras model fit
handle is not available outside the replica context or a tf distribute strategy update call
keras use gpu
tensorflow multi gpu training example
tensorflow estimator multi gpu

I'm trying to create a very large Keras model and distribute it across multiple GPUs. To be clear I'm not trying to put multiple copies of the same model on multiple GPUs; I'm trying to put one large model across multiple GPUs. I've been using the multi_gpu_model function in Keras but based off a lot of the out of memory errors I've gotten while doing this it seems like it's just replicating the model rather than distributing it like I'd like.

I looked into Horovod but because I have a lot of windows specific logging tools running I'm hesitant to use it.

This seems to leave only tf.estimators for me to use. It's not clear from documentation though how I would use these estimators to do what I'm trying to do. For example which distribution strategy in tf.contrib.distribute would allow me to effectively batch out the model in the way I'm looking to do?

Is what I'm seeking to do with estimators possible and if so which strategy should I use?

You may use Estimator API. Convert your model using tf.keras.estimator.model_to_estimator

session_config = tf.ConfigProto(allow_soft_placement=True)
distribute = tf.contrib.distribute.MirroredStrategy(num_gpus=4)
run_config = tf.estimator.RunConfig(train_distribute=distribute)
your_network = tf.keras.estimator.model_to_estimator(model_fn=your_keras_model, config=run_config)

Don't forget to compile model

Distributing a Keras Model Across Multiple GPUs, Setup input pipeline. When training a model with multiple GPUs, you can use the extra computing power effectively by increasing the batch size. In general, use  To do single-host, multi-device synchronous training with a Keras model, you would use the tf.distribute.MirroredStrategy API. Here's how it works: Instantiate a MirroredStrategy, optionally configuring which specific devices you want to use (by default the strategy will use all GPUs available).

You can manually assign different parts of your Keras model to different GPUs using the TensorFlow backend. This guide provides detailed examples and this article explains using Keras with TensorFlow.

import tensorflow as tf

with tf.device("/device:GPU:0"):
    #Create first part of your neural network

with tf.device("/device:GPU:1"):
    #Create second part of your neural network


with tf.device("/device:GPU:n"):
    #Create nth part of your neural network

Beware: Communication delays between the CPU and multiple GPUs may add a substantial overhead to training.

Distributed training with Keras, Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple Here is a snippet of code to do this for a very simple Keras model with one  When training a model with multiple GPUs, you can use the extra computing power effectively by increasing the batch size. In general, use the largest batch size that fits the GPU memory, and tune the learning rate accordingly.

You need device parallelism. This section of the Keras FAQ provides an example how to do this with Keras:

# Model where a shared LSTM is used to encode two different sequences in parallel
input_a = keras.Input(shape=(140, 256))
input_b = keras.Input(shape=(140, 256))

shared_lstm = keras.layers.LSTM(64)

# Process the first sequence on one GPU
with tf.device_scope('/gpu:0'):
    encoded_a = shared_lstm(tweet_a)
# Process the next sequence on another GPU
with tf.device_scope('/gpu:1'):
    encoded_b = shared_lstm(tweet_b)

# Concatenate results on CPU
with tf.device_scope('/cpu:0'):
    merged_vector = keras.layers.concatenate([encoded_a, encoded_b],

Distributed training with TensorFlow, MirroredStrategy for distributing your training workloads across multiple GPUs for tf.keras models. Distributed training can be particularly very  This tutorial uses the tf.distribute.MirroredStrategy, which does in-graph replication with synchronous training on many GPUs on one machine. Essentially, it copies all of the model's variables to each processor. Then, it uses all-reduce to combine the gradients from all processors and applies the combined value to all copies of the model.

Distributed training in tf.keras with Weights & Biases, from keras.utils import multi_gpu_model # Replicates `model` on 8 GPUs. optimizer='rmsprop') # This `fit` call will be distributed on 8 GPUs. # Since the batch  TensorFlow’s distributed strategies make it extremely easier for us to seamlessly scale up our heavy training workloads across multiple hardware accelerators — be it GPUs or even TPUs. That said, distributed training has been a challenge for a long time especially when it comes to neural network training.

Multi GPU in keras, For the device parallelism (aka model parallelism) see this FAQ: Keras which will make your training be distributed on multiple GPUs on one  The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes. This tutorial uses the tf.distribute.MirroredStrategy

Spliting keras model into multiple GPU's, Using Keras to train deep neural networks with multiple GPUs (Photo credit: now and I'm incredibly excited to see it as part of the official Keras distribution. we'll store a copy of the model on *every* GPU and then combine. There is a multi_gpu_model() function in Keras which will make your training be distributed on multiple GPUs on one machine. But, as it is stated in the documentation, this approach copies the graph on multiple GPUs and splits the batches to those multiple GPUs and later fuses them.