Remove Op:RestoreV2 when loading weights

Matthew_Krafczyk · May 23, 2022, 2:11am

I’m writing a machine learning library and one capability I’m trying to support is the ability to represent an entire model contained within a zip file. When loading a tensorflow model and preparing it for training, I need to unpack the weights into a temp directory, then load from there into the keras model. After this the temp directory is removed.

I’m finding when it’s time for training, components like the optimizer are trying to read their weights from the now missing temp directory, which is surprising, because I thought keras would load everything when I called model.load_weights or tf.train.Checkpoint(mdl).restore(checkpoint_path).expect_partial(). Aparently, these Op:RestoreV2 ops are being placed in the graph, and then when executed, they try to read from disk. Here’s an example of what I’m getting when fit is called:

Epoch 5/10000
2022-05-20 16:56:04.538743: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_tensor.cc:182 : NOT_FOUND: Unsuccessful TensorSliceRead
er constructor: Failed to find any matching files for /tmp/tmpicur7bqg/checkpoints/ckpt-1
Exception encountered in context thread! pid: 494291
Traceback (most recent call last):
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 33, in run
    super().run()
  File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 513, in __call__
    self.final_call(f, ctx_ret_q, tune_report_q, checkpoint_req_q, checkpoint_ret_q, *args, **kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 499, in final_call
    res = f(*args, **kwargs, tune_report_q=tune_report_q, checkpoint_req_q=checkpoint_req_q, checkpoint_ret_q=checkpoint_ret_q)
  File "/home/mkrafcz2/HAL_Projects/DRYML/dense_layer_lib_2.py", line 266, in train_mnist_object
    model.train(train_ds, train_spec=train_state, train_callbacks=callbacks, verbose=2, batch_size=32*batch_multiplier)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 258, in wrapped_func
    res = f(*args, **kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/models/dry_pipe.py", line 35, in train
    step.train(last_val, *args, train_spec=train_spec, train_callbacks=train_callbacks, **kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 258, in wrapped_func
    res = f(*args, **kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/models/tf/tf_base.py", line 403, in train
    self.train_fn(self, data, *args, train_spec=train_spec, train_callbacks=train_callbacks, **kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/context/process.py", line 258, in wrapped_func
    res = f(*args, **kwargs)
  File "/home/mkrafcz2/HAL_Projects/DRYML/src/dryml/models/tf/tf_base.py", line 321, in __call__
    trainable.model.mdl.fit(
  File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1129, in autograph_handler
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.NotFoundError: in user code:

    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/training.py", line 816, in train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 532, in minimize
        return self.apply_gradients(grads_and_vars, name=name)
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 639, in apply_gradients
        self._create_all_weights(var_list)
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 828, in _create_all_weights
        _ = self.iterations
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 835, in __getattribute__
        return super(OptimizerV2, self).__getattribute__(name)
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 990, in iterations
        self._iterations = self.add_weight(
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 1192, in add_weight
        variable = self._add_variable_with_custom_getter(
    File "/home/mkrafcz2/HAL_Projects/DRYML/venv_dryml/lib/python3.8/site-packages/keras/engine/base_layer_utils.py", line 117, in make_variable
        return tf.compat.v1.Variable(

    NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /tmp/tmpicur7bqg/checkpoints/ckpt-1 [Op:RestoreV2]

Is there any way to force tensorflow to actually read these values when I load the model? Or am I stuck having to come up with some scheme to persist these weights directories temporarily?

Amin_Jigari · May 24, 2022, 7:08am

I haven’t tried with the inception model. Do you have the model’s network structure with its names? You have to replicate the network and then load the weights and biases (the ckpt file) as Ryan explains. Maybe something has changed since Nov’15 and there’s a more straightforward approach now, I’m not sure.