Checkpointing keras model: TypeError: can't pickle _thread.lock objects

keras model checkpoint
load model keras
keras continue training from checkpoint
keras load model and predict
model.fit keras
modelcheckpoint tensorflow
save best model keras
keras load checkpoint

It seems like the error has occurred in the past in different contexts here, but I'm not dumping the model directly -- I'm using the ModelCheckpoint callback. Any idea what could be going wrong?

Information:

  • Keras version 2.0.8
  • Tensorflow version 1.3.0
  • Python 3.6

Minimal example to reproduce the error:

from keras.layers import Input, Lambda, Dense
from keras.models import Model
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
import tensorflow as tf
import numpy as np

x = Input(shape=(30,3))
low = tf.constant(np.random.rand(30, 3).astype('float32'))
high = tf.constant(1 + np.random.rand(30, 3).astype('float32'))
clipped_out_position = Lambda(lambda x, low, high: tf.clip_by_value(x, low, high),
                                      arguments={'low': low, 'high': high})(x)

model = Model(inputs=x, outputs=[clipped_out_position])
optimizer = Adam(lr=.1)
model.compile(optimizer=optimizer, loss="mean_squared_error")
checkpoint = ModelCheckpoint("debug.hdf", monitor="val_loss", verbose=1, save_best_only=True, mode="min")
training_callbacks = [checkpoint]
model.fit(np.random.rand(100, 30, 3), [np.random.rand(100, 30, 3)], callbacks=training_callbacks, epochs=50, batch_size=10, validation_split=0.33)

Error output:

Train on 67 samples, validate on 33 samples
Epoch 1/50
10/67 [===>..........................] - ETA: 0s - loss: 0.1627Epoch 00001: val_loss improved from inf to 0.17002, saving model to debug.hdf
Traceback (most recent call last):
  File "debug_multitask_inverter.py", line 19, in <module>
    model.fit(np.random.rand(100, 30, 3), [np.random.rand(100, 30, 3)], callbacks=training_callbacks, epochs=50, batch_size=10, validation_split=0.33)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/training.py", line 1631, in fit

▽
    validation_steps=validation_steps)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/training.py", line 1233, in _fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/callbacks.py", line 73, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/callbacks.py", line 414, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 2556, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/models.py", line 107, in save_model
    'config': model.get_config()
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/site-packages/keras/engine/topology.py", line 2397, in get_config
    return copy.deepcopy(config)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 215, in _deepcopy_list
    append(deepcopy(a, memo))
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/om/user/lnj/openmind_env/tensorflow-gpu/lib/python3.6/copy.py", line 169, in deepcopy
    rv = reductor(4)
TypeError: can't pickle _thread.lock objects

When saving a Lambda layer, the arguments passed in will also be saved. In this case, it contains two tf.Tensors. It seems that Keras does not support serializing tf.Tensor in the model config right now.

However, numpy arrays can be serialized without problem. So instead of passing tf.Tensor in arguments, you can pass in numpy arrays, and convert them into tf.Tensors in the lambda function.

x = Input(shape=(30,3))
low = np.random.rand(30, 3)
high = 1 + np.random.rand(30, 3)
clipped_out_position = Lambda(lambda x, low, high: tf.clip_by_value(x, tf.constant(low, dtype='float32'), tf.constant(high, dtype='float32')),
                              arguments={'low': low, 'high': high})(x)

A problem with the lines above is that, when trying to load this model, you might see a NameError: name 'tf' is not defined. That's because TensorFlow is not imported in the file where the Lambda layer is reconstructed (core.py).

Changing tf into K.tf can fix the problem. Also you can replace tf.constant() by K.constant(), which casts low and high into float32 tensors automatically.

from keras import backend as K
x = Input(shape=(30,3))
low = np.random.rand(30, 3)
high = 1 + np.random.rand(30, 3)
clipped_out_position = Lambda(lambda x, low, high: K.tf.clip_by_value(x, K.constant(low), K.constant(high)),
                              arguments={'low': low, 'high': high})(x)

How to Check-Point Deep Learning Models in Keras, Deep learning models can take hours, days or even weeks to train. If the run is The Keras library provides a checkpointing capability by a callback API. TypeError: __init__() got an unexpected keyword argument 'node'. Deep learning models can take hours, days or even weeks to train. If the run is stopped unexpectedly, you can lose a lot of work. In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library. Let’s get started. Update Mar/2017: Updated for Keras […]

To clarify: this is not a problem of Keras being unable to pickle a Tensor (other scenarios possible, see below) in a Lambda layer, but rather that the arguments of the python's function (here: a lambda function) are attempted to be serialized independently from the function (here: outside of the context of the lambda function itself). This works for 'static' arguments, but fails otherwise. In order to circumvent it, one should wrap the non-static function arguments in another function.

Here are a couple of workarounds:


  1. Use static variables, such as python/numpy-variables (just a mentioned above):
low = np.random.rand(30, 3)
high = 1 + np.random.rand(30, 3)

x = Input(shape=(30,3))
clipped_out_position = Lambda(lambda x: tf.clip_by_value(x, low, high))(x)

  1. Use functools.partial to wrap your lambda-function:
import functools

clip_by_value = functools.partial(
   tf.clip_by_value,
   clip_value_min=low,
   clip_value_max=high)

x = Input(shape=(30,3))
clipped_out_position = Lambda(lambda x: clip_by_value(x))(x)

  1. Use a closure to wrap your lambda-function:
low = tf.constant(np.random.rand(30, 3).astype('float32'))
high = tf.constant(1 + np.random.rand(30, 3).astype('float32'))

def clip_by_value(t):
    return tf.clip_by_value(t, low, high)

x = Input(shape=(30,3))
clipped_out_position = Lambda(lambda x: clip_by_value(x))(x)

Notice: although that you can sometimes drop the creation of explicit lambda-function and have this cleaner code snippet instead:

clipped_out_position = Lambda(clip_by_value)(x)

the absence of an extra wrapping layer of a lambda function (that is lambda t: clip_by_value(t)) might still lead to the same problem when doing 'deep-copy' of the function arguments, and should be avoided.


  1. Finally, you can wrap your model logic into a separate Keras layer, which in this particular case may look a bit over-engineered:
x = Input(shape=(30,3))
low = Lambda(lambda t: tf.constant(np.random.rand(30, 3).astype('float32')))(x)
high = Lambda(lambda t: tf.constant(1 + np.random.rand(30, 3).astype('float32')))(x)
clipped_out_position = Lambda(lambda x: tf.clip_by_value(*x))((x, low, high))

Notice: the tf.clip_by_value(*x) in the last Lambda layer is just an unpacked argument tuple, which can also be written in a more verbose form as tf.clip_by_value(x[0], x[1], x[2]) instead.


(below) As a side note: such a scenario, where your lambda-function is trying to capture (a part of) a class instance will also break the serialization (due to a late binding):

class MyModel:
    def __init__(self):
        self.low = np.random.rand(30, 3)
        self.high = 1 + np.random.rand(30, 3)

    def run(self):
        x = Input(shape=(30,3))
        clipped_out_position = Lambda(lambda x: tf.clip_by_value(x, self.low, self.high))(x)
        model = Model(inputs=x, outputs=[clipped_out_position])
        optimizer = Adam(lr=.1)
        model.compile(optimizer=optimizer, loss="mean_squared_error")
        checkpoint = ModelCheckpoint("debug.hdf", monitor="val_loss", verbose=1, save_best_only=True, mode="min")
        training_callbacks = [checkpoint]
        model.fit(np.random.rand(100, 30, 3), 
                 [np.random.rand(100, 30, 3)], callbacks=training_callbacks, epochs=50, batch_size=10, validation_split=0.33)

MyModel().run()

Which can be solved by assuring an early binding by this default argument trick:

        (...)
        clipped_out_position = Lambda(lambda x, l=self.low, h=self.high: tf.clip_by_value(x, l, h))(x)
        (...)

tf.keras.callbacks.ModelCheckpoint, hdf5 , then the model checkpoints will be saved with the epoch number and the validation loss in the filename. monitor, quantity to monitor. verbose, verbosity� GPU model and memory: GTX1050 (surface pro 2), 2GB memory; Describe the current behavior I receive the error: TypeError: can't pickle _thread.RLock objects when trying to save my keras model. I believe it has to do with the lambda layers I'm using, but I'm not sure which ones to fix.

Python Examples of keras.callbacks.ModelCheckpoint, To do so, we pass the base model to the callback. filepath="inference.weights. best.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1 � After checking all the existing answers on Stackoverflow here: Checkpointing keras model: TypeError: can't pickle _thread.lock objects and here: TypeError: can't pickle _thread.lock objects, I haven't found out why this won't work or what's wrong in my case. I am using Python 3. Here is my model building functions:

TypeError: can't pickle _thread.RLock objects When trying to save , So far (seen in code), I've tried following answer 3 of this question: Checkpointing keras model: TypeError: can't pickle _thread.lock objects. I have trouble using a trained Keras model in PySpark. The following versions of libraries are used: tensorflow==1.1.0 h5py==2.7.0 keras==2.0.4 Also, I use Spark 2.4.0. from pyspark.sql import

val_acc KeyError using ModelCheckpoint. � Issue #6104 � keras , Looked at Keras codebase but could not debug it, why is this error coming checkpoint = ModelCheckpoint('deep-learning-model-full-v0.03.01. TypeError: init() got an unexpected keyword argument 'restore_best_weights'� It'll save the model, but you can't use that model to fine-tune it on multi GPUs again. But: load the checkpoint on single gpu and make the model do several steps and save the new model (basically update the one). now try to load the newly generated model weights into multi gpu script. And, it'll continue training. Keras version: 2.2.4

Callbacks, hdf5 , then the model checkpoints will be saved with the epoch number and the validation loss in the filename. Arguments. filepath: string, path to save the model � The problem lies in this line of code: grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) Unfortunately - for now, keras is not supporting applying pickle to your model which is needed for sklearn to apply multiprocessing (here you may read the discussion on this).

Comments
  • Thank you so much for the excellent explanation + thorough answer!
  • Hi Yu-Yang, I just hit this bug. However I cannot use ndarray instead, cause the tensor I'm passing to lambda is K.shape(), which contains None. Do you have another solution?
  • Hi @Moyan, I have no solution right now, but I think you can always rebuild the model with the same code and use save_weights()/load_weights(). In this case, you'll need to set save_weights_only=True in ModelCheckpoint.
  • I figured it out yesterday. This exception is raised because we trying to serialize a "wandering tf.tensor". See my post below. Thank you for you answer, I won't solve this without your post.
  • I'm using the method 3 (a closure to wrap lambda function). It trains alright, but whenever I try to load the saved model I obtain a TypeError: 'str' object is not callable on the line where the lambda function is called. Do you know how to fix this?
  • @Gianluca Micchi Sep 25 '19 at 12:42 I encountered the same problem. When I used a closure to avoid the lambda model save problem, I can not load the model. I obtain a TypeError: 'str' object is not callable'