Error during training #17

SebastianJanampa · 2024-12-10T19:11:37Z

Hello,
I have some problems with the training. I have followed all the steps for the dataset creation. The only difference is that when I created the dataset, many images were missing.

I used the following command.

python -m siclib.train geocalib-pinhole-openpano     --conf geocalib  --mp float16 data.train_batch_size=2

and this is the error message

[12/10/2024 12:05:23 siclib INFO] [E 0 | it 0] loss {up_angle_error 8.838E+01, up_angle_error_weighted 8.845E+01, up_angle_recall@1 5.371E-04, up_angle_recall@3 1.460E-03, up_angle_recall@5 2.495E-03, up_angle_recall@10 5.386E-03, latitude_angle_error 2.282E+01, latitude_angle_error_weighted 2.284E+01, latitude_angle_recall@1 2.848E-02, latitude_angle_recall@3 7.611E-02, latitude_angle_recall@5 1.233E-01, latitude_angle_recall@10 2.493E-01, roll_error 8.856E+01, pitch_error 1.019E+01, gravity_error 8.891E+01, vfov_error 8.574E+01, k1_error 0.000E+00, stop_at 1.000E+01, initial_up_cost 5.882E-01, initial_latitude_cost 5.897E-02, initial_cost 6.471E-01, final_up_cost 1.195E-01, final_latitude_cost 1.133E-03, final_cost 1.207E-01, up-l1-loss 1.706E+00, up_total 1.706E+00, latitude-l1-loss 3.987E-01, latitude_total 3.987E-01, perspective_total 2.105E+00, gravity 1.857E+00, focal 1.096E+01, dist 0.000E+00, param_total 1.281E+01, total 1.492E+01}
siclib.visualization.visualize_batch.make_perspective_figures
siclib.visualization.visualize_batch.make_perspective_figures
siclib.visualization.visualize_batch.make_perspective_figures
(0, 0)
Traceback (most recent call last):
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 751, in <module>
    main_worker(0, conf, output_dir, args)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 687, in main_worker
    training(rank, conf, output_dir, args)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 575, in training
    results, pr_metrics, figures = do_evaluation(
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/sebastian/GeoCalib/siclib/train.py", line 152, in do_evaluation
    figures.append(locate(plot_fn)(pred, data))
  File "/home/sebastian/GeoCalib/siclib/visualization/visualize_batch.py", line 189, in make_perspective_figures
    figures |= make_camera_figure(pred, data, n_pairs)
  File "/home/sebastian/GeoCalib/siclib/visualization/visualize_batch.py", line 167, in make_camera_figure
    plot_latitudes(latitudes[i], is_radians=False, axes=ax[i, 1:])
  File "/home/sebastian/GeoCalib/siclib/visualization/viz2d.py", line 467, in plot_latitudes
    return plot_heatmaps(
  File "/home/sebastian/GeoCalib/siclib/visualization/viz2d.py", line 283, in plot_heatmaps
    contours = axes[i].contour(
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/__init__.py", line 1476, in inner
    return func(
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 6659, in contour
    contours = mcontour.QuadContourSet(self, *args, **kwargs)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 813, in __init__
    kwargs = self._process_args(*args, **kwargs)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 1474, in _process_args
    x, y, z = self._contour_args(args, kwargs)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 1511, in _contour_args
    x, y = self._initialize_x_y(z)
  File "/home/sebastian/anaconda3/envs/geocalib/lib/python3.9/site-packages/matplotlib/contour.py", line 1587, in _initialize_x_y
    raise TypeError(f"Input z must be at least a (2, 2) shaped array, "
TypeError: Input z must be at least a (2, 2) shaped array, but has shape (0, 0)

The text was updated successfully, but these errors were encountered:

veichta · 2024-12-12T17:33:56Z

Thank you for pointing this out. We have not trained using float16, and this might be causing some issues. I’m looking into this further, but in the meantime, a quick fix would be to comment out the visualization step causing the error at this line. This change will not affect the training process.

Please let me know if you encounter any further issues!

SebastianJanampa · 2024-12-12T17:53:36Z

Hello @veichta ,

I made the changes you mentioned.
First, I tried without --mp float16, but the problem persists.

What solved this problem was removing this line. After it, I was able to train using either float32 or float16.

I've got this warning:

[12/12/2024 10:49:44 siclib INFO] Setting epochs to 78 to match num_steps.
[12/12/2024 10:49:44 siclib INFO] Starting epoch 0
/home/sebastian/GeoCalib/siclib/train.py:476: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled=args.mixed_precision is not None, dtype=mp_dtype):
[12/12/2024 10:49:59 siclib WARNING] NaN detected in gradient clipping. Skipping iteration.

Could you tell me what does it mean?

Also, regarding the batch size, you mentioned in a paper that a batch size of 24 was used and that 2 GPU cards were used for training. Is 24 the batch size per GPU or the total batch size?

Thanks a lot for your help.

SebastianJanampa · 2024-12-12T18:00:31Z

[UPDATE]
The warning disappears when I train the model only on fp32. Does it mean the proposed optimizer doesn't work well with fp16? Have you run any experiments using the model (on inference) with fp16?

veichta · 2024-12-13T12:37:11Z

Thank you for the update!

The warning should not affect the training as it's just a deprecation warning which might break in future pytorch releases.

As for the batch size, it is the total batch size so for 2 gpus, each one will handle 12 sampels.

We did not do any experiments with float16 on training or inference. But any findings when you experiment with it would be very interesting!

SebastianJanampa mentioned this issue Dec 12, 2024

Help with dataset #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during training #17

Error during training #17

SebastianJanampa commented Dec 10, 2024

veichta commented Dec 12, 2024

SebastianJanampa commented Dec 12, 2024

SebastianJanampa commented Dec 12, 2024

veichta commented Dec 13, 2024

Error during training #17

Error during training #17

Comments

SebastianJanampa commented Dec 10, 2024

veichta commented Dec 12, 2024

SebastianJanampa commented Dec 12, 2024

SebastianJanampa commented Dec 12, 2024

veichta commented Dec 13, 2024