Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error in training Object Detection using the given example #47

Closed
yoppy-tjhin opened this issue Sep 26, 2024 · 6 comments
Closed

Error in training Object Detection using the given example #47

yoppy-tjhin opened this issue Sep 26, 2024 · 6 comments
Assignees

Comments

@yoppy-tjhin
Copy link

Hello,

My final goal is to do a custom dataset training on the object detection.
For a simple start, I try to follow the model training guide given in stm32ai-modelzoo/object_detecction/src.
The training config is attached.
my_config.txt

For the dataset, I took a few sample images from Pascal VOC 2012 dataset (attached).
pascal_dataset.zip

But, I got the following errors:
train_error_log.txt

What could be the problem?
Thank you for your guidance.

[INFO] : Starting training...
Epoch 1/1000
Error executing job with overrides: []
Traceback (most recent call last):
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 234, in main
process_mode(cfg,
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src./training\train.py", line 137, in train
training_model.fit(train_gen,
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 552, in safe_patch_function
patch_function.call(call_original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 170, in call
return cls().call(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 181, in call
raise e
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\au[INFO] : Starting training...
Epoch 1/1000
Error executing job with overrides: []
Traceback (most recent call last):
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 234, in main
process_mode(cfg,
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src./training\train.py", line 137, in train
training_model.fit(train_gen,
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 552, in safe_patch_function
patch_function.call(call_original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 170, in call
return cls().call(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 181, in call
raise e
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 174, in call
return self._patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 232, in _patch_implementation
result = super().patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\tensorflow_init
.py", line 1255, in _patch_implementation
history = original(inst, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 535, in call_original
return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 470, in call_original_fn_with_event_logging
original_fn_result = original_fn(og_args, **og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 532, in _original_fn
original_result = original(
_og_args, **_og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\engine\training.py", line 1395, in fit
raise ValueError('Unexpected result of train_function '
ValueError: Unexpected result of train_function (Empty logs). Please use Model.compile(..., run_eagerly=True), or tf.config.run_functions_eagerly(True) for more information of where went wrong, or file a issue/bug to tf.keras.tologging_utils\safety.py", line 174, in call
return self._patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 232, in _patch_implementation
result = super().patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\tensorflow_init
.py", line 1255, in _patch_implementation
history = original(inst, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 535, in call_original
return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 470, in call_original_fn_with_event_logging
original_fn_result = original_fn(og_args, **og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 532, in _original_fn
original_result = original(
_og_args, **_og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\engine\training.py", line 1395, in fit
raise ValueError('Unexpected result of train_function '
ValueError: Unexpected result of train_function (Empty logs). Please use Model.compile(..., run_eagerly=True), or tf.config.run_functions_eagerly(True) for more information of where went wrong, or file a issue/bug to `tf.keras'

@MCHSTM
Copy link

MCHSTM commented Sep 26, 2024

Hello,

Could you please share the version of tensorflow and python that you have in your environment.

@yoppy-tjhin
Copy link
Author

Hi,

I use Python 3.10.14.
And install using pip install -r requirement.txt file from the repo.

Package Version


absl-py 2.1.0
alembic 1.13.3
antlr4-python3-runtime 4.9.3
appdirs 1.4.4
astunparse 1.6.3
attrs 24.2.0
audioread 3.0.1
blinker 1.8.2
cachetools 5.5.0
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 2.2.1
cmaes 0.11.1
colorama 0.4.5
coloredlogs 15.0.1
colorlog 6.8.2
contourpy 1.3.0
cycler 0.12.1
databricks-cli 0.18.0
decorator 5.1.1
docker 6.1.3
entrypoints 0.4
fancycompleter 0.9.1
Flask 2.3.3
flatbuffers 24.3.25
fonttools 4.54.0
gast 0.6.0
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
greenlet 3.1.1
grpcio 1.66.1
h5py 3.11.0
humanfriendly 10.0
hydra-core 1.3.2
idna 3.10
imageio 2.35.1
imgaug 0.4.0
importlib-metadata 6.11.0
itsdangerous 2.2.0
Jinja2 3.1.4
joblib 1.2.0
keras 2.8.0
Keras-Preprocessing 1.1.2
kiwisolver 1.4.7
larq 0.13.3
lazy_loader 0.4
libclang 18.1.1
librosa 0.10.0.post2
llvmlite 0.43.0
Mako 1.2.4
Markdown 3.7
MarkupSafe 2.1.5
marshmallow 3.20.1
matplotlib 3.6.2
mlflow 2.3.0
mpmath 1.3.0
msgpack 1.1.0
munch 2.5.0
networkx 3.3
numba 0.60.0
numpy 1.23.4
oauthlib 3.2.2
omegaconf 2.3.0
onnx 1.12.0
onnxconverter-common 1.13.0
onnxruntime 1.14.1
opencv-python 4.6.0.66
opt-einsum 3.3.0
optuna 3.1.1
packaging 23.2
pandas 1.5.3
pillow 10.4.0
pip 24.2
pooch 1.6.0
protobuf 3.19.6
pyarrow 11.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycparser 2.22
Pygments 2.18.0
PyJWT 2.9.0
pyparsing 3.1.4
pyreadline3 3.5.4
pyrepl 0.9.0
pyserial 3.5
python-dateutil 2.9.0.post0
pytz 2023.4
pywin32 306
PyYAML 6.0.2
querystring-parser 1.2.4
requests 2.28.2
requests-oauthlib 2.0.0
rsa 4.9
scikit-image 0.24.0
scikit-learn 1.2.2
scipy 1.13.1
seaborn 0.12.2
setuptools 75.1.0
shapely 2.0.6
six 1.16.0
skl2onnx 1.14.0
smmap 5.0.1
soundfile 0.12.1
soxr 0.5.0.post1
SQLAlchemy 2.0.35
sqlparse 0.5.1
sympy 1.13.3
tabulate 0.9.0
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.8.3
tensorflow-estimator 2.8.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor 2.4.0
terminaltables 3.1.10
threadpoolctl 3.5.0
tifffile 2024.9.20
tqdm 4.65.0
typing_extensions 4.12.2
urllib3 1.26.13
waitress 2.1.2
websocket-client 1.8.0
Werkzeug 3.0.4
wget 3.2
wheel 0.44.0
wmctrl 0.5
wrapt 1.16.0
xmlrunner 1.7.7
zipp 3.20.2

@MCHSTM
Copy link

MCHSTM commented Sep 26, 2024

Hello,

The number of images in your dataset is smaller than the batch size you are using in the yaml file.

So either you provide a bigger dataset or simply for your test you can reduce the batch size in the yaml file to 8 for example.

Best Regards.

@yoppy-tjhin
Copy link
Author

Hi Mahdi,

Thank you for your prompt reply.
The training is now working.

But at the end of training, I got the following error:

Epoch 998/1000
1/1 [==============================] - 1s 829ms/step - loss: 447.4241 - lr: 0.0010
Epoch 999/1000
1/1 [==============================] - 1s 708ms/step - loss: 467.6047 - lr: 0.0010
Epoch 1000/1000
1/1 [==============================] - 1s 707ms/step - loss: 611.9567 - lr: 0.0010
[2024-09-27 03:15:11,352][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,356][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/events.out.tfevents.1727406042.39729a34b9c6.1961.0.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,365][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/train_metrics.csv -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,376][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,379][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/train/events.out.tfevents.1727406042.39729a34b9c6.1961.1.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,397][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
[2024-09-27 03:15:11,401][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/validation/events.out.tfevents.1727406050.39729a34b9c6.1961.2.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
Training runtime: 0:14:29
Error executing job with overrides: []
Traceback (most recent call last):
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 234, in main
process_mode(cfg,
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "/content/drive/My Drive/stm32ai/object_detection/src/././training/train.py", line 155, in train
inference_model.load_weights(best_weights_path)
File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 562, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/saved_models/best_weights.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked that the file best_weights.h5 is indeed not present in the saved_models directory.

@yoppy-tjhin
Copy link
Author

Hi Mahdi,

Thank you for your prompt reply. The training is now working.

But at the end of training, I got the following error:

Epoch 998/1000
1/1 [==============================] - 1s 829ms/step - loss: 447.4241 - lr: 0.0010
Epoch 999/1000
1/1 [==============================] - 1s 708ms/step - loss: 467.6047 - lr: 0.0010
Epoch 1000/1000
1/1 [==============================] - 1s 707ms/step - loss: 611.9567 - lr: 0.0010
[2024-09-27 03:15:11,352][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,356][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/events.out.tfevents.1727406042.39729a34b9c6.1961.0.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,365][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/train_metrics.csv -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,376][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,379][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/train/events.out.tfevents.1727406042.39729a34b9c6.1961.1.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,397][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
[2024-09-27 03:15:11,401][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/validation/events.out.tfevents.1727406050.39729a34b9c6.1961.2.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
Training runtime: 0:14:29
Error executing job with overrides: []
Traceback (most recent call last):
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 234, in main
process_mode(cfg,
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "/content/drive/My Drive/stm32ai/object_detection/src/././training/train.py", line 155, in train
inference_model.load_weights(best_weights_path)
File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 562, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/saved_models/best_weights.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked that the file best_weights.h5 is indeed not present in the saved_models directory.

Ah, it turns out that the validation dataset is too small as well. The batch size need to be lowered.

@MCHSTM
Copy link

MCHSTM commented Sep 27, 2024

Glad It worked for you.
Closing the issue.
Best Regards.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants