Error in _compute_target_reward_value #325

ArshamSol · 2025-02-17T10:50:09Z

Hi, I am trying to use stochastic muZero to play a game. Here is my config file:


from easydict import EasyDict
#from zoo.matchthree.envs.matchthree_wrapper import branch_size

#env_id = 'match3_lightzero'
action_space_size = 127
chance_space_size = 1

# ==============================================================
# begin of the most frequently changed config specified by the user
# ==============================================================
collector_env_num = 1
n_episode = 2
evaluator_env_num = 1
num_simulations = 25
update_per_collect = 100
batch_size = 256
max_env_step = 500
reanalyze_ratio = 0.
use_ture_chance_label_in_chance_encoder = True
#num_unroll_steps = 2

# ==============================================================
# end of the most frequently changed config specified by the user
# ==============================================================

matchthree_stochastic_muzero_config = dict(
    exp_name=f'data_stochastic_mz/match3_stochastic_muzero_ns{num_simulations}_upc{update_per_collect}_rer{reanalyze_ratio}_chance{chance_space_size}_seed0',
    env=dict(
        stop_value=int(1e6),
        #env_id=env_id,
        obs_shape=(9, 64, 64) ,
        collector_env_num=collector_env_num,
        evaluator_env_num=evaluator_env_num,
        n_evaluator_episode=evaluator_env_num,
        manager=dict(shared_memory=False, ),   
        channel_last=False,
    ),
    policy=dict(
        model=dict(
            observation_shape=(9, 64, 64),
            frame_stack_num=1,
            image_channel=9,
            action_space_size=action_space_size,
            chance_space_size=chance_space_size,
            downsample=True,
            self_supervised_learning_loss=True,  # default is False
            discrete_action_encoding_type='one_hot',
            norm_type='BN',
            support_scale = 70,
            value_support_size = 141,
            reward_support_size = 141,
            #device='cpu',
            #categorical_distribution = False,
        ),
        model_path=None,
        cuda=True,
        gumbel_algo=False,
        mcts_ctree=True,
        env_type='not_board_games',
        game_segment_length=400,
        use_augmentation=True,
        update_per_collect=update_per_collect,
        batch_size=batch_size,
        optim_type='Adam',
        piecewise_decay_lr_scheduler=False,
        learning_rate=3e-3,
        discount_factor=0.997,
        num_simulations=num_simulations,
        reanalyze_ratio=reanalyze_ratio,
        ssl_loss_weight=2, 
        n_episode=n_episode,
        eval_freq=int(2e3),
        replay_buffer_size=int(1e6),
        collector_env_num=collector_env_num,
        evaluator_env_num=evaluator_env_num,
    ),
)
matchthree_stochastic_muzero_config = EasyDict(matchthree_stochastic_muzero_config)
main_config = matchthree_stochastic_muzero_config

matchthree_stochastic_muzero_create_config = dict(
    env=dict(
        type='match3_lightzero',
        import_names=['zoo.matchthree.envs.matchthree_lightzero_env'],
    ),
    env_manager=dict(type='subprocess'),
    policy=dict(
        type='stochastic_muzero',
        import_names=['lzero.policy.stochastic_muzero'],
    ),
)  
matchthree_stochastic_muzero_create_config = EasyDict(matchthree_stochastic_muzero_create_config)
create_config = matchthree_stochastic_muzero_create_config


if __name__ == "__main__":
    from lzero.entry import train_muzero
    train_muzero([main_config, create_config], seed=0, model_path=main_config.policy.model_path, max_env_step=max_env_step)

And then I receive following error from game_buffer_muzero script:
value_list = value_list.reshape(-1) * (
ValueError: operands could not be broadcast together with shapes (6144,) (1536,)

Here is the Traceback:

Traceback (most recent call last):
  File "/home/arsham/miniconda3/envs/mlagents/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/arsham/miniconda3/envs/mlagents/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/arsham/.vscode/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
    cli.main()
  File "/home/arsham/.vscode/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
    run()
  File "/home/arsham/.vscode/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/arsham/.vscode/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
  File "/home/arsham/.vscode/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/home/arsham/.vscode/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "/home/arsham/LightZero/zoo/matchthree/config/matchthree_stochastic_muzero_config.py", line 104, in <module>
    train_muzero([main_config, create_config], seed=0, model_path=main_config.policy.model_path, max_env_step=max_env_step)
  File "/home/arsham/LightZero/lzero/entry/train_muzero.py", line 202, in train_muzero
    train_data = replay_buffer.sample(batch_size, policy)
  File "/home/arsham/LightZero/lzero/mcts/buffer/game_buffer_muzero.py", line 141, in sample
    batch_rewards, batch_target_values = self._compute_target_reward_value(
  File "/home/arsham/LightZero/lzero/mcts/buffer/game_buffer_muzero.py", line 485, in _compute_target_reward_value
    value_list = value_list.reshape(-1) * (
ValueError: operands could not be broadcast together with shapes (6144,) (1536,)

The problem is value_list have shape 4x transition_batch_size.
I would be grateful for any guidance on how to resolve the issue.

The text was updated successfully, but these errors were encountered:

puyuan1996 · 2025-02-18T15:53:27Z

Hello,

Could you confirm whether any modifications have been made to components other than config and the environment?

Additionally, could you provide detailed debugging information within _compute_target_reward_value()? Specifically, details such as m_output.latent_state.shape, transition_batch_size, and value_list.shape would be helpful.

By default, transition_batch_size is calculated as game_segment_batch_size * (num_unroll_steps + 1), where num_unroll_steps defaults to 5, and game_segment_batch_size corresponds to batch_size in config.

Thank you for your attention! If you have any further questions, please feel free to ask.

puyuan1996 added the bug Something isn't working label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in _compute_target_reward_value #325

Error in _compute_target_reward_value #325

ArshamSol commented Feb 17, 2025 •

edited

Loading

puyuan1996 commented Feb 18, 2025

Error in _compute_target_reward_value #325

Error in _compute_target_reward_value #325

Comments

ArshamSol commented Feb 17, 2025 • edited Loading

puyuan1996 commented Feb 18, 2025

ArshamSol commented Feb 17, 2025 •

edited

Loading