ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions #41

NicholasMcElroy · 2021-06-23T21:24:06Z

Hello,
I'm currently using your library to do some operations on the graph of a model in TensorFlow 2, and I'm having some issues with figuring out the proper way to convert a tensor to either a gde.Node or gde.Tensor object to use in the library's functions. I'm converting my tensors as follows:

gra is the name of my gde.Graph object, for reference. After converting the tensors this way, when I run get backward walk ops on my ys_g I get a placeholder operation, and when I run get forward walk ops on the xs_g I get ValueError("Operation {} does not belong to given graph".format(op)) as an error. Looking at the code in the util file I see that this is returned after checking that the op has a value for its graph attribute, so I'm guessing this is what's causing issues with my code. How can I make sure that this attribute gets a value when converting? Any help is appreciated, thank you!

frreiss · 2021-06-25T16:24:50Z

Thanks for reaching out @NicholasMcElroy! Might you have a self-contained piece of Python code that reproduces the problem you are seeing?

NicholasMcElroy · 2021-06-25T18:08:11Z

It's a bit complex as this is a function that uses variables from another script, but here's the snippet I'm working on:

def gradients(ys, xs, graph, grad_ys=None, **kwargs):
    # Serialize graph for use within this function
    g = gde.Graph(graph.as_graph_def())
    xs_g = []
    for x in xs:
        xs_g.append(gde.Node(x, x.name, x.op, g=g))
    ys_g = gde.Node(ys, ys.name, ys.op, g=g)
    # Get a list of forward and backward operations
    ops_list = gde.make_list_of_op(g, allow_graph=True)
    back_ops = gde.get_backward_walk_ops(ys_g,
                                         inclusive=True)
    debug_print("back_ops: %s", back_ops)
    fwd_ops = gde.get_forward_walk_ops(xs_g,
                                       inclusive=True,
                                       within_ops=back_ops)

And here's where the function is called:

tf_g = tf.Graph()
with tf_g.as_default():
        args = parser.parse_args()
        enc = encoder.get_encoder(args.model_name, models_dir=args.models_dir)
        hparams = model.default_hparams()
        with open(os.path.join('models', args.model_name, 'hparams.json')) as f:
            hparams.override_from_dict(json.load(f))

        if args.sample_length > hparams.n_ctx:
            raise ValueError(
                "Can't get samples longer than window size: %s" % hparams.n_ctx)

        with tf.Session() as sess:
            # Fully static shape required to make memory accounting in
            # twremat accurate.
            train_context = tf.placeholder(tf.int32, [args.batch_size, 1024])
            train_context_in = randomize(train_context, hparams, args.noise)
            train_output = model.model(hparams=hparams, X=train_context_in)
            train_loss = tf.reduce_mean(
                tf.nn.sparse_softmax_cross_entropy_with_logits(
                    labels=train_context[:, 1:], logits=train_output['logits'][:, :-1]))

            if args.val_every > 0:
                val_context = tf.placeholder(tf.int32, [args.val_batch_size, None])
                val_output = model.model(hparams=hparams, X=val_context)
                val_loss = tf.reduce_mean(
                    tf.nn.sparse_softmax_cross_entropy_with_logits(
                        labels=val_context[:, 1:], logits=val_output['logits'][:, :-1]))
                val_loss_summary = tf.summary.scalar('val_loss', val_loss)

            sample_context = tf.placeholder(tf.int32, [args.batch_size, None])
            tf_sample = sample.sample_sequence(
                hparams=hparams,
                length=args.sample_length,
                context=sample_context,
                batch_size=args.batch_size,
                temperature=1.0,
                top_k=args.top_k,
                top_p=args.top_p)

            all_vars = [v for v in tf.trainable_variables() if 'model' in v.name]
            train_vars = [v for v in all_vars if '/h' in v.name] if args.only_train_transformer_layers else all_vars
            opt_grads = gradients(train_loss, train_vars, tf_g)

frreiss · 2021-07-01T18:36:30Z

Sorry, I'm still having trouble reproducing this. Could you provide a stack trace so I can see which of the calls from get_forward_walk_ops() to get_unique_graph() is triggering this error?

NicholasMcElroy · 2021-07-05T20:34:32Z

I've been messing around with it a bit so the error I'm getting now is a little different, but here's the stack trace of what I'm getting now:

Traceback (most recent call last):
  File "./traintest.py", line 325, in <module>
    main()
  File "./traintest.py", line 146, in main
    opt_grads = tensorgrader.gradients(train_loss, train_vars, tf_g)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 933, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 764, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3289, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /content/drive/MyDrive/nlp/tensorgrader.py:30 gradients  *
        fwd_ops = gde.get_forward_walk_ops(xs_n,
    /usr/local/lib/python3.7/dist-packages/graph_def_editor/select.py:466 get_forward_walk_ops  *
        for new_t in op.outputs:
    /usr/local/lib/python3.7/dist-packages/graph_def_editor/node.py:170 outputs
        raise ValueError("Outputs of {} have not been set".format(self))

    ValueError: Outputs of Node[<bound method BaseResourceVariable.value of <tf.Variable 'model/h11/attn/c_attn/w:0' shape=(1, 768, 2304) dtype=float32>>|name: "model/h11/attn/c_attn/w"
    op: "VarHandleOp"
    attr {
      key: "_class"
      value {
        list {
          s: "loc:@model/h11/attn/c_attn/w"
        }
      }
    }
    attr {
      key: "allowed_devices"
      value {
        list {
        }
      }
    }
    attr {
      key: "container"
      value {
        s: ""
      }
    }
    attr {
      key: "dtype"
      value {
        type: DT_FLOAT
      }
    }
    attr {
      key: "shape"
      value {
        shape {
          dim {
            size: 1
          }
          dim {
            size: 768
          }
          dim {
            size: 2304
          }
        }
      }
    }
    attr {
      key: "shared_name"
      value {
        s: "model/h11/attn/c_attn/w"
      }
    }
    ] have not been set

frreiss · 2021-07-30T22:40:07Z

Sorry for the delay in getting back to this.

The most recent stack trace seems to indicate that there's a problem in the conversion from protocol buffers to Node and Graph objects. I've added some defensive type checking code to the Node class's constructor that will hopefully catch the problem closer to its root cause. The code is currently in this branch: https://github.com/frreiss/graph_def_editor_fred/tree/node-type-check

Could you try running your program against the code in that branch and seeing what error results?

frreiss mentioned this issue Jul 30, 2021

Add type-checking code to Node constructor #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions #41

ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions #41

NicholasMcElroy commented Jun 23, 2021

frreiss commented Jun 25, 2021

NicholasMcElroy commented Jun 25, 2021 •

edited

Loading

frreiss commented Jul 1, 2021

NicholasMcElroy commented Jul 5, 2021

frreiss commented Jul 30, 2021

ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions #41

ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions #41

Comments

NicholasMcElroy commented Jun 23, 2021

frreiss commented Jun 25, 2021

NicholasMcElroy commented Jun 25, 2021 • edited Loading

frreiss commented Jul 1, 2021

NicholasMcElroy commented Jul 5, 2021

frreiss commented Jul 30, 2021

NicholasMcElroy commented Jun 25, 2021 •

edited

Loading