Error in training Object Detection using the given example #47

yoppy-tjhin · 2024-09-26T07:17:46Z

Hello,

My final goal is to do a custom dataset training on the object detection.
For a simple start, I try to follow the model training guide given in stm32ai-modelzoo/object_detecction/src.
The training config is attached.
my_config.txt

For the dataset, I took a few sample images from Pascal VOC 2012 dataset (attached).
pascal_dataset.zip

But, I got the following errors:
train_error_log.txt

What could be the problem?
Thank you for your guidance.

[INFO] : Starting training...
Epoch 1/1000
Error executing job with overrides: []
Traceback (most recent call last):
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 234, in main
process_mode(cfg,
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src./training\train.py", line 137, in train
training_model.fit(train_gen,
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 552, in safe_patch_function
patch_function.call(call_original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 170, in call
return cls().call(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 181, in call
raise e
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\au[INFO] : Starting training...
Epoch 1/1000
Error executing job with overrides: []
Traceback (most recent call last):
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 234, in main
process_mode(cfg,
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src\stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "G:\Kantor\Research\SLIFA\STMicro\ST_Model_Training\stm32ai-modelzoo\object_detection\src./training\train.py", line 137, in train
training_model.fit(train_gen,
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 552, in safe_patch_function
patch_function.call(call_original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 170, in call
return cls().call(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 181, in call
raise e
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 174, in call
return self._patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 232, in _patch_implementation
result = super().patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\tensorflow_init.py", line 1255, in _patch_implementation
history = original(inst, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 535, in call_original
return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 470, in call_original_fn_with_event_logging
original_fn_result = original_fn(og_args, **og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 532, in _original_fn
original_result = original(_og_args, **_og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\engine\training.py", line 1395, in fit
raise ValueError('Unexpected result of train_function '
ValueError: Unexpected result of train_function (Empty logs). Please use Model.compile(..., run_eagerly=True), or tf.config.run_functions_eagerly(True) for more information of where went wrong, or file a issue/bug to tf.keras.tologging_utils\safety.py", line 174, in call
return self._patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 232, in _patch_implementation
result = super().patch_implementation(original, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\tensorflow_init.py", line 1255, in _patch_implementation
history = original(inst, *args, **kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 535, in call_original
return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 470, in call_original_fn_with_event_logging
original_fn_result = original_fn(og_args, **og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\mlflow\utils\autologging_utils\safety.py", line 532, in _original_fn
original_result = original(_og_args, **_og_kwargs)
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\user\anaconda3\envs\stm32ai\lib\site-packages\keras\engine\training.py", line 1395, in fit
raise ValueError('Unexpected result of train_function '
ValueError: Unexpected result of train_function (Empty logs). Please use Model.compile(..., run_eagerly=True), or tf.config.run_functions_eagerly(True) for more information of where went wrong, or file a issue/bug to `tf.keras'

The text was updated successfully, but these errors were encountered:

MCHSTM · 2024-09-26T08:54:38Z

Hello,

Could you please share the version of tensorflow and python that you have in your environment.

yoppy-tjhin · 2024-09-26T08:58:23Z

Hi,

I use Python 3.10.14.
And install using pip install -r requirement.txt file from the repo.

Package Version

absl-py 2.1.0
alembic 1.13.3
antlr4-python3-runtime 4.9.3
appdirs 1.4.4
astunparse 1.6.3
attrs 24.2.0
audioread 3.0.1
blinker 1.8.2
cachetools 5.5.0
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 2.2.1
cmaes 0.11.1
colorama 0.4.5
coloredlogs 15.0.1
colorlog 6.8.2
contourpy 1.3.0
cycler 0.12.1
databricks-cli 0.18.0
decorator 5.1.1
docker 6.1.3
entrypoints 0.4
fancycompleter 0.9.1
Flask 2.3.3
flatbuffers 24.3.25
fonttools 4.54.0
gast 0.6.0
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
greenlet 3.1.1
grpcio 1.66.1
h5py 3.11.0
humanfriendly 10.0
hydra-core 1.3.2
idna 3.10
imageio 2.35.1
imgaug 0.4.0
importlib-metadata 6.11.0
itsdangerous 2.2.0
Jinja2 3.1.4
joblib 1.2.0
keras 2.8.0
Keras-Preprocessing 1.1.2
kiwisolver 1.4.7
larq 0.13.3
lazy_loader 0.4
libclang 18.1.1
librosa 0.10.0.post2
llvmlite 0.43.0
Mako 1.2.4
Markdown 3.7
MarkupSafe 2.1.5
marshmallow 3.20.1
matplotlib 3.6.2
mlflow 2.3.0
mpmath 1.3.0
msgpack 1.1.0
munch 2.5.0
networkx 3.3
numba 0.60.0
numpy 1.23.4
oauthlib 3.2.2
omegaconf 2.3.0
onnx 1.12.0
onnxconverter-common 1.13.0
onnxruntime 1.14.1
opencv-python 4.6.0.66
opt-einsum 3.3.0
optuna 3.1.1
packaging 23.2
pandas 1.5.3
pillow 10.4.0
pip 24.2
pooch 1.6.0
protobuf 3.19.6
pyarrow 11.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycparser 2.22
Pygments 2.18.0
PyJWT 2.9.0
pyparsing 3.1.4
pyreadline3 3.5.4
pyrepl 0.9.0
pyserial 3.5
python-dateutil 2.9.0.post0
pytz 2023.4
pywin32 306
PyYAML 6.0.2
querystring-parser 1.2.4
requests 2.28.2
requests-oauthlib 2.0.0
rsa 4.9
scikit-image 0.24.0
scikit-learn 1.2.2
scipy 1.13.1
seaborn 0.12.2
setuptools 75.1.0
shapely 2.0.6
six 1.16.0
skl2onnx 1.14.0
smmap 5.0.1
soundfile 0.12.1
soxr 0.5.0.post1
SQLAlchemy 2.0.35
sqlparse 0.5.1
sympy 1.13.3
tabulate 0.9.0
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.8.3
tensorflow-estimator 2.8.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor 2.4.0
terminaltables 3.1.10
threadpoolctl 3.5.0
tifffile 2024.9.20
tqdm 4.65.0
typing_extensions 4.12.2
urllib3 1.26.13
waitress 2.1.2
websocket-client 1.8.0
Werkzeug 3.0.4
wget 3.2
wheel 0.44.0
wmctrl 0.5
wrapt 1.16.0
xmlrunner 1.7.7
zipp 3.20.2

MCHSTM · 2024-09-26T09:12:49Z

Hello,

The number of images in your dataset is smaller than the batch size you are using in the yaml file.

So either you provide a bigger dataset or simply for your test you can reduce the batch size in the yaml file to 8 for example.

Best Regards.

yoppy-tjhin · 2024-09-27T03:23:18Z

Hi Mahdi,

Thank you for your prompt reply.
The training is now working.

But at the end of training, I got the following error:

Epoch 998/1000
1/1 [==============================] - 1s 829ms/step - loss: 447.4241 - lr: 0.0010
Epoch 999/1000
1/1 [==============================] - 1s 708ms/step - loss: 467.6047 - lr: 0.0010
Epoch 1000/1000
1/1 [==============================] - 1s 707ms/step - loss: 611.9567 - lr: 0.0010
[2024-09-27 03:15:11,352][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,356][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/events.out.tfevents.1727406042.39729a34b9c6.1961.0.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,365][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/train_metrics.csv -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,376][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,379][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/train/events.out.tfevents.1727406042.39729a34b9c6.1961.1.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,397][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
[2024-09-27 03:15:11,401][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/validation/events.out.tfevents.1727406050.39729a34b9c6.1961.2.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
Training runtime: 0:14:29
Error executing job with overrides: []
Traceback (most recent call last):
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 234, in main
process_mode(cfg,
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "/content/drive/My Drive/stm32ai/object_detection/src/././training/train.py", line 155, in train
inference_model.load_weights(best_weights_path)
File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 562, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/saved_models/best_weights.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked that the file best_weights.h5 is indeed not present in the saved_models directory.

yoppy-tjhin · 2024-09-27T05:28:59Z

Hi Mahdi,

Thank you for your prompt reply. The training is now working.

But at the end of training, I got the following error:

Epoch 998/1000
1/1 [==============================] - 1s 829ms/step - loss: 447.4241 - lr: 0.0010
Epoch 999/1000
1/1 [==============================] - 1s 708ms/step - loss: 467.6047 - lr: 0.0010
Epoch 1000/1000
1/1 [==============================] - 1s 707ms/step - loss: 611.9567 - lr: 0.0010
[2024-09-27 03:15:11,352][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,356][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/events.out.tfevents.1727406042.39729a34b9c6.1961.0.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,365][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/metrics/train_metrics.csv -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/metrics
[2024-09-27 03:15:11,376][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,379][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/train/events.out.tfevents.1727406042.39729a34b9c6.1961.1.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/train
[2024-09-27 03:15:11,397][root][INFO] - creating /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
[2024-09-27 03:15:11,401][root][INFO] - copying /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/logs/validation/events.out.tfevents.1727406050.39729a34b9c6.1961.2.v2 -> /content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/mlruns/539758967607481009/c64ef0ac55364b49a7e27d8126bebb8b/artifacts/tensorboard_logs/validation
Training runtime: 0:14:29
Error executing job with overrides: []
Traceback (most recent call last):
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 234, in main
process_mode(cfg,
File "/content/drive/My Drive/stm32ai/object_detection/src/./stm32ai_main.py", line 98, in process_mode
train_glob(cfg, train_ds=train_ds, valid_ds=valid_ds, test_ds=test_ds, train_gen=train_gen, valid_gen=valid_gen)
File "/content/drive/My Drive/stm32ai/object_detection/src/././training/train.py", line 155, in train
inference_model.load_weights(best_weights_path)
File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 562, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/usr/local/lib/python3.10/dist-packages/h5py/_hl/files.py", line 235, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/content/drive/My Drive/stm32ai/object_detection/src/experiments_outputs/2024_09_27_02_59_58/saved_models/best_weights.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked that the file best_weights.h5 is indeed not present in the saved_models directory.

Ah, it turns out that the validation dataset is too small as well. The batch size need to be lowered.

MCHSTM · 2024-09-27T07:37:25Z

Glad It worked for you.
Closing the issue.
Best Regards.

LFOSTM assigned MCHSTM Sep 26, 2024

MCHSTM closed this as completed Sep 27, 2024

yoppy-tjhin mentioned this issue Sep 27, 2024

Error - Object Detection Training on 'ssd_mobilenet_v2_fpnlite' #48

Closed

yoppy-tjhin mentioned this issue Oct 8, 2024

Error running custom model on STM32MP257 #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in training Object Detection using the given example #47

Error in training Object Detection using the given example #47

yoppy-tjhin commented Sep 26, 2024

MCHSTM commented Sep 26, 2024

yoppy-tjhin commented Sep 26, 2024

MCHSTM commented Sep 26, 2024

yoppy-tjhin commented Sep 27, 2024

yoppy-tjhin commented Sep 27, 2024

MCHSTM commented Sep 27, 2024

Error in training Object Detection using the given example #47

Error in training Object Detection using the given example #47

Comments

yoppy-tjhin commented Sep 26, 2024

MCHSTM commented Sep 26, 2024

yoppy-tjhin commented Sep 26, 2024

MCHSTM commented Sep 26, 2024

yoppy-tjhin commented Sep 27, 2024

yoppy-tjhin commented Sep 27, 2024

MCHSTM commented Sep 27, 2024