Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ValueError("Operation {} does not belong to given graph".format(op)) when running get walk ops functions #41

Open
NicholasMcElroy opened this issue Jun 23, 2021 · 5 comments

Comments

@NicholasMcElroy
Copy link
Contributor

Hello,
I'm currently using your library to do some operations on the graph of a model in TensorFlow 2, and I'm having some issues with figuring out the proper way to convert a tensor to either a gde.Node or gde.Tensor object to use in the library's functions. I'm converting my tensors as follows:
cLfqsCA 1
gra is the name of my gde.Graph object, for reference. After converting the tensors this way, when I run get backward walk ops on my ys_g I get a placeholder operation, and when I run get forward walk ops on the xs_g I get ValueError("Operation {} does not belong to given graph".format(op)) as an error. Looking at the code in the util file I see that this is returned after checking that the op has a value for its graph attribute, so I'm guessing this is what's causing issues with my code. How can I make sure that this attribute gets a value when converting? Any help is appreciated, thank you!

@frreiss
Copy link
Member

frreiss commented Jun 25, 2021

Thanks for reaching out @NicholasMcElroy! Might you have a self-contained piece of Python code that reproduces the problem you are seeing?

@NicholasMcElroy
Copy link
Contributor Author

NicholasMcElroy commented Jun 25, 2021

It's a bit complex as this is a function that uses variables from another script, but here's the snippet I'm working on:

def gradients(ys, xs, graph, grad_ys=None, **kwargs):
    # Serialize graph for use within this function
    g = gde.Graph(graph.as_graph_def())
    xs_g = []
    for x in xs:
        xs_g.append(gde.Node(x, x.name, x.op, g=g))
    ys_g = gde.Node(ys, ys.name, ys.op, g=g)
    # Get a list of forward and backward operations
    ops_list = gde.make_list_of_op(g, allow_graph=True)
    back_ops = gde.get_backward_walk_ops(ys_g,
                                         inclusive=True)
    debug_print("back_ops: %s", back_ops)
    fwd_ops = gde.get_forward_walk_ops(xs_g,
                                       inclusive=True,
                                       within_ops=back_ops)

And here's where the function is called:

tf_g = tf.Graph()
with tf_g.as_default():
        args = parser.parse_args()
        enc = encoder.get_encoder(args.model_name, models_dir=args.models_dir)
        hparams = model.default_hparams()
        with open(os.path.join('models', args.model_name, 'hparams.json')) as f:
            hparams.override_from_dict(json.load(f))

        if args.sample_length > hparams.n_ctx:
            raise ValueError(
                "Can't get samples longer than window size: %s" % hparams.n_ctx)

        with tf.Session() as sess:
            # Fully static shape required to make memory accounting in
            # twremat accurate.
            train_context = tf.placeholder(tf.int32, [args.batch_size, 1024])
            train_context_in = randomize(train_context, hparams, args.noise)
            train_output = model.model(hparams=hparams, X=train_context_in)
            train_loss = tf.reduce_mean(
                tf.nn.sparse_softmax_cross_entropy_with_logits(
                    labels=train_context[:, 1:], logits=train_output['logits'][:, :-1]))

            if args.val_every > 0:
                val_context = tf.placeholder(tf.int32, [args.val_batch_size, None])
                val_output = model.model(hparams=hparams, X=val_context)
                val_loss = tf.reduce_mean(
                    tf.nn.sparse_softmax_cross_entropy_with_logits(
                        labels=val_context[:, 1:], logits=val_output['logits'][:, :-1]))
                val_loss_summary = tf.summary.scalar('val_loss', val_loss)

            sample_context = tf.placeholder(tf.int32, [args.batch_size, None])
            tf_sample = sample.sample_sequence(
                hparams=hparams,
                length=args.sample_length,
                context=sample_context,
                batch_size=args.batch_size,
                temperature=1.0,
                top_k=args.top_k,
                top_p=args.top_p)

            all_vars = [v for v in tf.trainable_variables() if 'model' in v.name]
            train_vars = [v for v in all_vars if '/h' in v.name] if args.only_train_transformer_layers else all_vars
            opt_grads = gradients(train_loss, train_vars, tf_g)

@frreiss
Copy link
Member

frreiss commented Jul 1, 2021

Sorry, I'm still having trouble reproducing this. Could you provide a stack trace so I can see which of the calls from get_forward_walk_ops() to get_unique_graph() is triggering this error?

@NicholasMcElroy
Copy link
Contributor Author

I've been messing around with it a bit so the error I'm getting now is a little different, but here's the stack trace of what I'm getting now:

Traceback (most recent call last):
  File "./traintest.py", line 325, in <module>
    main()
  File "./traintest.py", line 146, in main
    opt_grads = tensorgrader.gradients(train_loss, train_vars, tf_g)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 933, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 764, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3289, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /content/drive/MyDrive/nlp/tensorgrader.py:30 gradients  *
        fwd_ops = gde.get_forward_walk_ops(xs_n,
    /usr/local/lib/python3.7/dist-packages/graph_def_editor/select.py:466 get_forward_walk_ops  *
        for new_t in op.outputs:
    /usr/local/lib/python3.7/dist-packages/graph_def_editor/node.py:170 outputs
        raise ValueError("Outputs of {} have not been set".format(self))

    ValueError: Outputs of Node[<bound method BaseResourceVariable.value of <tf.Variable 'model/h11/attn/c_attn/w:0' shape=(1, 768, 2304) dtype=float32>>|name: "model/h11/attn/c_attn/w"
    op: "VarHandleOp"
    attr {
      key: "_class"
      value {
        list {
          s: "loc:@model/h11/attn/c_attn/w"
        }
      }
    }
    attr {
      key: "allowed_devices"
      value {
        list {
        }
      }
    }
    attr {
      key: "container"
      value {
        s: ""
      }
    }
    attr {
      key: "dtype"
      value {
        type: DT_FLOAT
      }
    }
    attr {
      key: "shape"
      value {
        shape {
          dim {
            size: 1
          }
          dim {
            size: 768
          }
          dim {
            size: 2304
          }
        }
      }
    }
    attr {
      key: "shared_name"
      value {
        s: "model/h11/attn/c_attn/w"
      }
    }
    ] have not been set

@frreiss
Copy link
Member

frreiss commented Jul 30, 2021

Sorry for the delay in getting back to this.

The most recent stack trace seems to indicate that there's a problem in the conversion from protocol buffers to Node and Graph objects. I've added some defensive type checking code to the Node class's constructor that will hopefully catch the problem closer to its root cause. The code is currently in this branch: https://github.com/frreiss/graph_def_editor_fred/tree/node-type-check

Could you try running your program against the code in that branch and seeing what error results?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants