Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error during training #17

Open
SebastianJanampa opened this issue Dec 10, 2024 · 4 comments
Open

Error during training #17

SebastianJanampa opened this issue Dec 10, 2024 · 4 comments

Comments

@SebastianJanampa
Copy link

Hello,
I have some problems with the training. I have followed all the steps for the dataset creation. The only difference is that when I created the dataset, many images were missing.

I used the following command.

python -m siclib.train geocalib-pinhole-openpano     --conf geocalib  --mp float16 data.train_batch_size=2

and this is the error message

[12/10/2024 12:05:23 siclib INFO] [E 0 | it 0] loss {up_angle_error 8.838E+01, up_angle_error_weighted 8.845E+01, up_angle_recall@1 5.371E-04, up_angle_recall@3 1.460E-03, up_angle_recall@5 2.495E-03, up_angle_recall@10 5.386E-03, latitude_angle_error 2.282E+01, latitude_angle_error_weighted 2.284E+01, latitude_angle_recall@1 2.848E-02, latitude_angle_recall@3 7.611E-02, latitude_angle_recall@5 1.233E-01, latitude_angle_recall@10 2.493E-01, roll_error 8.856E+01, pitch_error 1.019E+01, gravity_error 8.891E+01, vfov_error 8.574E+01, k1_error 0.000E+00, stop_at 1.000E+01, initial_up_cost 5.882E-01, initial_latitude_cost 5.897E-02, initial_cost 6.471E-01, final_up_cost 1.195E-01, final_latitude_cost 1.133E-03, final_cost 1.207E-01, up-l1-loss 1.706E+00, up_total 1.706E+00, latitude-l1-loss 3.987E-01, latitude_total 3.987E-01, perspective_total 2.105E+00, gravity 1.857E+00, focal 1.096E+01, dist 0.000E+00, param_total 1.281E+01, total 1.492E+01}
siclib.visualization.visualize_batch.make_perspective_figures
siclib.visualization.visualize_batch.make_perspective_figures
siclib.visualization.visualize_batch.make_perspective_figures
(0, 0)
Traceback (most recent call last):
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 751, in <module>
    main_worker(0, conf, output_dir, args)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 687, in main_worker
    training(rank, conf, output_dir, args)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 575, in training
    results, pr_metrics, figures = do_evaluation(
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 152, in do_evaluation
    figures.append(locate(plot_fn)(pred, data))
  File "/home/sebastian/GeoCalib/siclib/visualization/visualize_batch.py", line 189, in make_perspective_figures
    figures |= make_camera_figure(pred, data, n_pairs)
  File "/home/sebastian/GeoCalib/siclib/visualization/visualize_batch.py", line 167, in make_camera_figure
    plot_latitudes(latitudes[i], is_radians=False, axes=ax[i, 1:])
  File "/home/sebastian/GeoCalib/siclib/visualization/viz2d.py", line 467, in plot_latitudes
    return plot_heatmaps(
  File "/home/sebastian/GeoCalib/siclib/visualization/viz2d.py", line 283, in plot_heatmaps
    contours = axes[i].contour(
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/__init__.py", line 1476, in inner
    return func(
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 6659, in contour
    contours = mcontour.QuadContourSet(self, *args, **kwargs)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 813, in __init__
    kwargs = self._process_args(*args, **kwargs)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 1474, in _process_args
    x, y, z = self._contour_args(args, kwargs)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 1511, in _contour_args
    x, y = self._initialize_x_y(z)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 1587, in _initialize_x_y
    raise TypeError(f"Input z must be at least a (2, 2) shaped array, "
TypeError: Input z must be at least a (2, 2) shaped array, but has shape (0, 0)

@veichta
Copy link
Collaborator

veichta commented Dec 12, 2024

Thank you for pointing this out. We have not trained using float16, and this might be causing some issues. I’m looking into this further, but in the meantime, a quick fix would be to comment out the visualization step causing the error at this line. This change will not affect the training process.

Please let me know if you encounter any further issues!

@SebastianJanampa
Copy link
Author

Hello @veichta ,

I made the changes you mentioned.
First, I tried without --mp float16, but the problem persists.

What solved this problem was removing this line. After it, I was able to train using either float32 or float16.

I've got this warning:

[12/12/2024 10:49:44 siclib INFO] Setting epochs to 78 to match num_steps.
[12/12/2024 10:49:44 siclib INFO] Starting epoch 0
/home/sebastian/GeoCalib/siclib/train.py:476: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled=args.mixed_precision is not None, dtype=mp_dtype):
[12/12/2024 10:49:59 siclib WARNING] NaN detected in gradient clipping. Skipping iteration.

Could you tell me what does it mean?

Also, regarding the batch size, you mentioned in a paper that a batch size of 24 was used and that 2 GPU cards were used for training. Is 24 the batch size per GPU or the total batch size?

Thanks a lot for your help.

@SebastianJanampa
Copy link
Author

[UPDATE]
The warning disappears when I train the model only on fp32. Does it mean the proposed optimizer doesn't work well with fp16? Have you run any experiments using the model (on inference) with fp16?

@veichta
Copy link
Collaborator

veichta commented Dec 13, 2024

Thank you for the update!

The warning should not affect the training as it's just a deprecation warning which might break in future pytorch releases.

As for the batch size, it is the total batch size so for 2 gpus, each one will handle 12 sampels.

We did not do any experiments with float16 on training or inference. But any findings when you experiment with it would be very interesting!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants