(dl_venv) [vijayi@sparky01 mnist]$ python3 second_run.py PyTorch version 2.1.2 available. /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 Ray processor params: None Initializing new Ray cluster... 2024-04-08 16:19:22,199 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 Model resume path '/data/vijayi/ludwig/examples/mnist/results/simple_image_experiment_single_model' exists, trying to resume training. ?????????????????????????? ? EXPERIMENT DESCRIPTION ? ?????????????????????????? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? Experiment name ? simple_image_experiment ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? Model name ? single_model ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? Output directory ? /data/vijayi/ludwig/examples/mnist/results/simple_image_experiment_single_model ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? ludwig_version ? '0.10.2.dev' ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? command ? 'second_run.py' ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? commit_hash ? 'f28a3c5c45ef' ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? random_seed ? 42 ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? data_format ? "" ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? torch_version ? '2.1.2+cu121' ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ? compute ? {'num_nodes': 1} ? ?????????????????????????????????????????????????????????????????????????????????????????????????????? ????????????????? ? LUDWIG CONFIG ? ????????????????? User-specified config (with upgrades): { 'input_features': [ { 'encoder': { 'conv_layers': [ { 'filter_size': 3, 'num_filters': 32, 'pool_size': 2, 'pool_stride': 2}, { 'dropout': 0.4, 'filter_size': 3, 'num_filters': 64, 'pool_size': 2, 'pool_stride': 2}], 'fc_layers': [ { 'dropout': 0.4, 'output_size': 128}], 'type': 'stacked_cnn'}, 'name': 'image_path', 'preprocessing': {'num_processes': 4}, 'type': 'image'}], 'ludwig_version': '0.10.2.dev', 'output_features': [{'name': 'label', 'type': 'category'}], 'trainer': {'early_stop': -1, 'epochs': 10}} Full config saved to: /data/vijayi/ludwig/examples/mnist/results/simple_image_experiment_single_model/simple_image_experiment/model/model_hyperparameters.json ????????????????? ? PREPROCESSING ? ????????????????? Backend config has num_cpu not set. provision_preprocessing_workers() is a no-op in this case. No cached dataset found at /data/vijayi/ludwig/examples/mnist/6f52999cf5fe11eebe14525400e88c5a.training.parquet. Preprocessing the dataset. Using in_memory = False is not supported with dataframe data format. Using full dataframe Building dataset (it may take a while) Inferring num_channels from the first 20 images. images with 1 channels: 20 Using 1 channels because it is the majority in sample. If an image with a different depth is read, will attempt to convert to 1 channels. To explicitly set the number of channels, define num_channels in the preprocessing dictionary of the image input feature config. 2024-04-08 16:19:26.011 | INFO | daft.context:runner:78 - Using RayRunner 2024-04-08 16:19:26.011 | WARNING | daft.runners.ray_runner:__init__:616 - Ray has already been initialized, Daft will reuse the existing Ray context. 2024-04-08 16:19:26,035 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 172.16.26.38:36163... 2024-04-08 16:19:26,035 INFO worker.py:1382 -- Calling ray.init() again after it has already been called. (single_partition_pipeline pid=1162538) PyTorch version 2.1.2 available. (single_partition_pipeline pid=1162538) 2024-04-08 16:19:27.030 | INFO | logging:info:1446 - PyTorch version 2.1.2 available. (single_partition_pipeline pid=1162538) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (single_partition_pipeline pid=1162538) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (single_partition_pipeline pid=1162538) warn("The installed version of bitsandbytes was compiled without GPU support. " Building dataset: DONE 2024-04-08 16:19:36.761 | INFO | logging:info:1446 - Building dataset: DONE Validation set empty. If this is unintentional, please check the preprocessing configuration. 2024-04-08 16:19:36.998 | WARNING | logging:warning:1458 - Validation set empty. If this is unintentional, please check the preprocessing configuration. Dataset Statistics 2024-04-08 16:19:37.020 | INFO | logging:info:1446 - Dataset Statistics ?????????????????????????????????????????????????? ? Dataset ? Size (Rows) ? Size (In Memory) ? ?????????????????????????????????????????????????? ? Training ? 17 ? 2.14 Kb ? ?????????????????????????????????????????????????? ? Test ? 3 ? 387 b ? ?????????????????????????????????????????????????? 2024-04-08 16:19:37.020 | INFO | logging:info:1446 - ?????????????????????????????????????????????????? ? Dataset ? Size (Rows) ? Size (In Memory) ? ?????????????????????????????????????????????????? ? Training ? 17 ? 2.14 Kb ? ?????????????????????????????????????????????????? ? Test ? 3 ? 387 b ? ?????????????????????????????????????????????????? 2024-04-08 16:19:37.021 | INFO | logging:info:1446 - ????????? 2024-04-08 16:19:37.021 | INFO | logging:info:1446 - ????????? ? MODEL ? 2024-04-08 16:19:37.021 | INFO | logging:info:1446 - ? MODEL ? ????????? 2024-04-08 16:19:37.021 | INFO | logging:info:1446 - ????????? 2024-04-08 16:19:37.021 | INFO | logging:info:1446 - Warnings and other logs: 2024-04-08 16:19:37.021 | INFO | logging:info:1446 - Warnings and other logs: (TrainTrainable pid=1163129) PyTorch version 2.1.2 available. (TrainTrainable pid=1163129) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (TrainTrainable pid=1163129) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (TrainTrainable pid=1163129) warn("The installed version of bitsandbytes was compiled without GPU support. " (RayTrainWorker pid=1163252) 2024-04-08 16:19:51,505 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=1163252) [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:35115 (errno: 97 - Address family not supported by protocol). (RayTrainWorker pid=1163252) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [sparky01.elevo.ai]:35115 (errno: 97 - Address family not supported by protocol). (RayTrainWorker pid=1163252) PyTorch version 2.1.2 available. (RayTrainWorker pid=1163252) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (RayTrainWorker pid=1163252) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (RayTrainWorker pid=1163252) warn("The installed version of bitsandbytes was compiled without GPU support. " (RayTrainWorker pid=1163252) Using DDP strategy (RayTrainWorker pid=1163252) Tuning batch size... (RayTrainWorker pid=1163252) 2024-04-08 16:19:58,446 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches()] (RayTrainWorker pid=1163252) Tuning batch size... (RayTrainWorker pid=1163252) Exploring batch_size=1 (RayTrainWorker pid=1163252) Throughput at batch_size=1: 193.69650 samples/s (RayTrainWorker pid=1163252) Exploring batch_size=2 (RayTrainWorker pid=1163252) Throughput at batch_size=2: 334.46067 samples/s (RayTrainWorker pid=1163252) Batch size 4 is invalid, must be less than or equal to 20.0% dataset size (3 samples of 17) and less than or equal to max batch size 128 (RayTrainWorker pid=1163252) Selected batch_size=2 (TrainTrainable pid=1163464) PyTorch version 2.1.2 available. (TrainTrainable pid=1163464) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (TrainTrainable pid=1163464) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (TrainTrainable pid=1163464) warn("The installed version of bitsandbytes was compiled without GPU support. " (RayTrainWorker pid=1163579) 2024-04-08 16:20:17,484 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=1163579) [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:56691 (errno: 97 - Address family not supported by protocol). (RayTrainWorker pid=1163579) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [sparky01.elevo.ai]:56691 (errno: 97 - Address family not supported by protocol). (RayTrainWorker pid=1163579) PyTorch version 2.1.2 available. (RayTrainWorker pid=1163579) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (RayTrainWorker pid=1163579) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (RayTrainWorker pid=1163579) warn("The installed version of bitsandbytes was compiled without GPU support. " (RayTrainWorker pid=1163579) Using DDP strategy (RayTrainWorker pid=1163579) Tuning batch size... (RayTrainWorker pid=1163579) 2024-04-08 16:20:24,285 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches()] (RayTrainWorker pid=1163579) Tuning batch size... (RayTrainWorker pid=1163579) Exploring batch_size=1 (RayTrainWorker pid=1163579) Throughput at batch_size=1: 1151.64854 samples/s (RayTrainWorker pid=1163579) Exploring batch_size=2 (RayTrainWorker pid=1163579) Throughput at batch_size=2: 1408.66633 samples/s (RayTrainWorker pid=1163579) Batch size 4 is invalid, must be less than or equal to 20.0% dataset size (3 samples of 17) and less than or equal to max batch size 128 (RayTrainWorker pid=1163579) Selected batch_size=2 2024-04-08 16:20:27.069 | INFO | logging:info:1446 - ???????????? 2024-04-08 16:20:27.070 | INFO | logging:info:1446 - ???????????? ? TRAINING ? 2024-04-08 16:20:27.070 | INFO | logging:info:1446 - ? TRAINING ? ???????????? 2024-04-08 16:20:27.070 | INFO | logging:info:1446 - ???????????? 2024-04-08 16:20:27.070 | INFO | logging:info:1446 - (TrainTrainable pid=1163760) PyTorch version 2.1.2 available. (TrainTrainable pid=1163760) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (TrainTrainable pid=1163760) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (TrainTrainable pid=1163760) warn("The installed version of bitsandbytes was compiled without GPU support. " (TorchTrainer pid=1163760) /data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/_internal/pipelined_dataset_iterator.py:126: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a DatasetPipeline as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs. (TorchTrainer pid=1163760) warnings.warn( (TorchTrainer pid=1163760) /data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/_internal/bulk_dataset_iterator.py:108: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs. (TorchTrainer pid=1163760) warnings.warn( (RayTrainWorker pid=1163899) 2024-04-08 16:20:41,111 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1] (RayTrainWorker pid=1163899) [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:53855 (errno: 97 - Address family not supported by protocol). (RayTrainWorker pid=1163899) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [sparky01.elevo.ai]:53855 (errno: 97 - Address family not supported by protocol). (RayTrainWorker pid=1163899) PyTorch version 2.1.2 available. (RayTrainWorker pid=1163899) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 (RayTrainWorker pid=1163899) /data/vijayi/dl_venv/lib64/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. (RayTrainWorker pid=1163899) warn("The installed version of bitsandbytes was compiled without GPU support. " (RayTrainWorker pid=1163899) Using DDP strategy (RayTrainWorker pid=1163899) /data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/_internal/bulk_dataset_iterator.py:108: UserWarning: session.get_dataset_shard returns a ray.data.DatasetIterator instead of a Dataset as of Ray v2.3. Use iter_torch_batches(), to_tf(), or iter_batches() to iterate over one epoch. See https://docs.ray.io/en/latest/data/api/dataset_iterator.html for full DatasetIterator docs. (RayTrainWorker pid=1163899) warnings.warn( (RayTrainWorker pid=1163899) 2024-04-08 16:20:48,031 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[randomize_block_order] (RayTrainWorker pid=1163899) Loading progress tracker for model: /data/vijayi/ludwig/examples/mnist/results/simple_image_experiment_single_model/model/training_progress.json (RayTrainWorker pid=1163899) Successfully loaded model weights from /data/vijayi/ludwig/examples/mnist/results/simple_image_experiment_single_model/model/training_checkpoints/latest.ckpt. (RayTrainWorker pid=1163899) Resuming training from previous run. (RayTrainWorker pid=1163899) 2024-04-08 16:20:48,063 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches()] (RayTrainWorker pid=1163899) 2024-04-08 16:20:48,109 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[random_shuffle] (RayTrainWorker pid=1163899) 2024-04-08 16:20:48,134 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[random_shuffle] (RayTrainWorker pid=1163899) 2024-04-08 16:20:48,138 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches()] (RayTrainWorker pid=1163899) 2024-04-08 16:20:48,154 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[random_shuffle] Training: 50%|?????????????????????????????????????????????????????? | 45/90 [00:00 AllToAllOperator[random_shuffle] (RayTrainWorker pid=1163899) 2024-04-08 16:20:49,710 INFO bulk_executor.py:39 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[random_shuffle] 2024-04-08 16:20:49,902 WARNING worker.py:1866 -- Traceback (most recent call last): File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 226, in iter_batches blocks_owned_by_consumer = self._peek()._plan.execute()._owned_by_consumer File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 1319, in _peek first_dataset_gen = next(dataset_iter) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 732, in __next__ raise StopIteration StopIteration The above exception was the direct cause of the following exception: Traceback (most recent call last): File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span return method(self, *_args, **_kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute raise skipped from exception_cause(skipped) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper train_func(*args, **kwargs) File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 501, in lambda config: train_fn(**config), File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 215, in train_fn results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs) File "/data/vijayi/ludwig/ludwig/distributed/base.py", line 157, in wrapped res = fn(*args, **kwargs) File "/data/vijayi/ludwig/ludwig/trainers/trainer.py", line 1038, in train batcher.set_epoch(progress_tracker.epoch, progress_tracker.batch_size) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 355, in set_epoch self._fetch_next_epoch() File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 380, in _fetch_next_epoch self._fetch_next_batch() File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 389, in _fetch_next_batch self._next_batch = next(self.dataset_batch_iter) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 469, in async_read raise batch File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 454, in producer for batch in pipeline.iter_batches(prefetch_blocks=0, batch_size=batch_size, batch_format="pandas"): RuntimeError: generator raised StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump return Pickler.dump(self, obj) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint( RecursionError: maximum recursion depth exceeded in comparison The above exception was the direct cause of the following exception: Traceback (most recent call last): File "python/ray/_raylet.pyx", line 1166, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 1072, in ray._raylet.execute_task_with_cancellation_handler File "python/ray/_raylet.pyx", line 805, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 972, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 611, in ray._raylet.store_task_errors File "python/ray/_raylet.pyx", line 2524, in ray._raylet.CoreWorker.store_task_outputs File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 450, in serialize return self._serialize_to_msgpack(value) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack value = value.to_bytes() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/exceptions.py", line 32, in to_bytes serialized_exception=pickle.dumps(self), File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps cp.dump(obj) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump raise pickle.PicklingError(msg) from e _pickle.PicklingError: Could not pickle object as excessively deep recursion required. An unexpected internal error occurred while the worker was executing a task. 2024-04-08 16:20:49,904 WARNING worker.py:1866 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff13b210ebcacca6c8973345ac01000000 Worker ID: 7c42f52700436f7a298a8b909da6899599280854f4aa300dcc6d8b09 Node ID: 9c5e100a9096e00525c868250b43d0102c1c284b5865b2a2bb0ec2d5 Worker IP address: 172.16.26.38 Worker port: 33323 Worker PID: 1163899 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. Traceback (most recent call last): File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 226, in iter_batches blocks_owned_by_consumer = self._peek()._plan.execute()._owned_by_consumer File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 1319, in _peek first_dataset_gen = next(dataset_iter) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 732, in __next__ raise StopIteration StopIteration The above exception was the direct cause of the following exception: Traceback (most recent call last): File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span return method(self, *_args, **_kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute raise skipped from exception_cause(skipped) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper train_func(*args, **kwargs) File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 501, in lambda config: train_fn(**config), File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 215, in train_fn results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs) File "/data/vijayi/ludwig/ludwig/distributed/base.py", line 157, in wrapped res = fn(*args, **kwargs) File "/data/vijayi/ludwig/ludwig/trainers/trainer.py", line 1038, in train batcher.set_epoch(progress_tracker.epoch, progress_tracker.batch_size) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 355, in set_epoch self._fetch_next_epoch() File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 380, in _fetch_next_epoch self._fetch_next_batch() File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 389, in _fetch_next_batch self._next_batch = next(self.dataset_batch_iter) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 469, in async_read raise batch File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 454, in producer for batch in pipeline.iter_batches(prefetch_blocks=0, batch_size=batch_size, batch_format="pandas"): RuntimeError: generator raised StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump return Pickler.dump(self, obj) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint( RecursionError: maximum recursion depth exceeded in comparison The above exception was the direct cause of the following exception: Traceback (most recent call last): File "python/ray/_raylet.pyx", line 1166, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 1072, in ray._raylet.execute_task_with_cancellation_handler File "python/ray/_raylet.pyx", line 805, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 972, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 611, in ray._raylet.store_task_errors File "python/ray/_raylet.pyx", line 2524, in ray._raylet.CoreWorker.store_task_outputs File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 450, in serialize return self._serialize_to_msgpack(value) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack value = value.to_bytes() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/exceptions.py", line 32, in to_bytes serialized_exception=pickle.dumps(self), File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps cp.dump(obj) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump raise pickle.PicklingError(msg) from e _pickle.PicklingError: Could not pickle object as excessively deep recursion required. An unexpected internal error occurred while the worker was executing a task. (TorchTrainer pid=1163760) 2024-04-08 16:20:49,905 INFO utils.py:57 -- Worker 0 has failed. (RayTrainWorker pid=1163899) 2024-04-08 16:20:49,900 ERROR worker.py:772 -- Worker exits with an exit code None. (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 226, in iter_batches (RayTrainWorker pid=1163899) blocks_owned_by_consumer = self._peek()._plan.execute()._owned_by_consumer (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 1319, in _peek (RayTrainWorker pid=1163899) first_dataset_gen = next(dataset_iter) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 732, in __next__ (RayTrainWorker pid=1163899) raise StopIteration (RayTrainWorker pid=1163899) StopIteration (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) The above exception was the direct cause of the following exception: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor (RayTrainWorker pid=1163899) return method(__ray_actor, *args, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span (RayTrainWorker pid=1163899) return method(self, *_args, **_kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute (RayTrainWorker pid=1163899) raise skipped from exception_cause(skipped) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper (RayTrainWorker pid=1163899) train_func(*args, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 501, in (RayTrainWorker pid=1163899) lambda config: train_fn(**config), (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 215, in train_fn (RayTrainWorker pid=1163899) results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/distributed/base.py", line 157, in wrapped (RayTrainWorker pid=1163899) res = fn(*args, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/trainers/trainer.py", line 1038, in train (RayTrainWorker pid=1163899) batcher.set_epoch(progress_tracker.epoch, progress_tracker.batch_size) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 355, in set_epoch (RayTrainWorker pid=1163899) self._fetch_next_epoch() (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 380, in _fetch_next_epoch (RayTrainWorker pid=1163899) self._fetch_next_batch() (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 389, in _fetch_next_batch (RayTrainWorker pid=1163899) self._next_batch = next(self.dataset_batch_iter) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 469, in async_read (RayTrainWorker pid=1163899) raise batch (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 454, in producer (RayTrainWorker pid=1163899) for batch in pipeline.iter_batches(prefetch_blocks=0, batch_size=batch_size, batch_format="pandas"): (RayTrainWorker pid=1163899) RuntimeError: generator raised StopIteration (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) During handling of the above exception, another exception occurred: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump (RayTrainWorker pid=1163899) return Pickler.dump(self, obj) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override (RayTrainWorker pid=1163899) if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint( (RayTrainWorker pid=1163899) RecursionError: maximum recursion depth exceeded in comparison (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) The above exception was the direct cause of the following exception: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 1166, in ray._raylet.task_execution_handler (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 1072, in ray._raylet.execute_task_with_cancellation_handler (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 805, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 972, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 611, in ray._raylet.store_task_errors (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 2524, in ray._raylet.CoreWorker.store_task_outputs (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 450, in serialize (RayTrainWorker pid=1163899) return self._serialize_to_msgpack(value) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack (RayTrainWorker pid=1163899) value = value.to_bytes() (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/exceptions.py", line 32, in to_bytes (RayTrainWorker pid=1163899) serialized_exception=pickle.dumps(self), (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps (RayTrainWorker pid=1163899) cp.dump(obj) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump (RayTrainWorker pid=1163899) raise pickle.PicklingError(msg) from e (RayTrainWorker pid=1163899) _pickle.PicklingError: Could not pickle object as excessively deep recursion required. (RayTrainWorker pid=1163899) An unexpected internal error occurred while the worker was executing a task. (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 226, in iter_batches (RayTrainWorker pid=1163899) blocks_owned_by_consumer = self._peek()._plan.execute()._owned_by_consumer (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 1319, in _peek (RayTrainWorker pid=1163899) first_dataset_gen = next(dataset_iter) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 732, in __next__ (RayTrainWorker pid=1163899) raise StopIteration (RayTrainWorker pid=1163899) StopIteration (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) The above exception was the direct cause of the following exception: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor (RayTrainWorker pid=1163899) return method(__ray_actor, *args, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span (RayTrainWorker pid=1163899) return method(self, *_args, **_kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute (RayTrainWorker pid=1163899) raise skipped from exception_cause(skipped) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper (RayTrainWorker pid=1163899) train_func(*args, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 501, in (RayTrainWorker pid=1163899) lambda config: train_fn(**config), (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/backend/ray.py", line 215, in train_fn (RayTrainWorker pid=1163899) results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/distributed/base.py", line 157, in wrapped (RayTrainWorker pid=1163899) res = fn(*args, **kwargs) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/trainers/trainer.py", line 1038, in train (RayTrainWorker pid=1163899) batcher.set_epoch(progress_tracker.epoch, progress_tracker.batch_size) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 355, in set_epoch (RayTrainWorker pid=1163899) self._fetch_next_epoch() (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 380, in _fetch_next_epoch (RayTrainWorker pid=1163899) self._fetch_next_batch() (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 389, in _fetch_next_batch (RayTrainWorker pid=1163899) self._next_batch = next(self.dataset_batch_iter) (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 469, in async_read (RayTrainWorker pid=1163899) raise batch (RayTrainWorker pid=1163899) File "/data/vijayi/ludwig/ludwig/data/dataset/ray.py", line 454, in producer (RayTrainWorker pid=1163899) for batch in pipeline.iter_batches(prefetch_blocks=0, batch_size=batch_size, batch_format="pandas"): (RayTrainWorker pid=1163899) RuntimeError: generator raised StopIteration (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) During handling of the above exception, another exception occurred: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump (RayTrainWorker pid=1163899) return Pickler.dump(self, obj) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 826, in reducer_override (RayTrainWorker pid=1163899) if sys.version_info[:2] < (3, 7) and _is_parametrized_type_hint( (RayTrainWorker pid=1163899) RecursionError: maximum recursion depth exceeded in comparison (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) The above exception was the direct cause of the following exception: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 1166, in ray._raylet.task_execution_handler (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 1072, in ray._raylet.execute_task_with_cancellation_handler (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 805, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 972, in ray._raylet.execute_task (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 611, in ray._raylet.store_task_errors (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 2524, in ray._raylet.CoreWorker.store_task_outputs (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 450, in serialize (RayTrainWorker pid=1163899) return self._serialize_to_msgpack(value) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/serialization.py", line 405, in _serialize_to_msgpack (RayTrainWorker pid=1163899) value = value.to_bytes() (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/exceptions.py", line 32, in to_bytes (RayTrainWorker pid=1163899) serialized_exception=pickle.dumps(self), (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps (RayTrainWorker pid=1163899) cp.dump(obj) (RayTrainWorker pid=1163899) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 739, in dump (RayTrainWorker pid=1163899) raise pickle.PicklingError(msg) from e (RayTrainWorker pid=1163899) _pickle.PicklingError: Could not pickle object as excessively deep recursion required. (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) During handling of the above exception, another exception occurred: (RayTrainWorker pid=1163899) (RayTrainWorker pid=1163899) Traceback (most recent call last): (RayTrainWorker pid=1163899) File "python/ray/_raylet.pyx", line 1207, in ray._raylet.task_execution_handler (RayTrainWorker pid=1163899) SystemExit 2024-04-08 16:20:50,103 WARNING worker.py:1866 -- Traceback (most recent call last): File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 857, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 861, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 803, in ray._raylet.execute_task.function_executor File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span return method(self, *_args, **_kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/tune/trainable/trainable.py", line 368, in train raise skipped from exception_cause(skipped) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint return self._trainable_func( File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span return method(self, *_args, **_kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/base_trainer.py", line 505, in _trainable_func super()._trainable_func(self._merged_config, reporter, checkpoint_dir) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func output = fn() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/base_trainer.py", line 415, in train_func trainer.training_loop() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/data_parallel_trainer.py", line 395, in training_loop self._report(training_iterator) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/data_parallel_trainer.py", line 342, in _report for results in training_iterator: File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/trainer.py", line 134, in __next__ next_results = self._run_with_error_handling(self._fetch_next_result) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/trainer.py", line 97, in _run_with_error_handling return func() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/trainer.py", line 168, in _fetch_next_result results = self._backend_executor.get_next_results() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/backend_executor.py", line 442, in get_next_results results = self.get_with_failure_handling(futures) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/backend_executor.py", line 531, in get_with_failure_handling self._increment_failures() File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/backend_executor.py", line 593, in _increment_failures raise failure File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure ray.get(object_ref) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: RayTrainWorker actor_id: 13b210ebcacca6c8973345ac01000000 pid: 1163899 namespace: 193b17b0-77f4-4b89-a867-d6bc259e214d ip: 172.16.26.38 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. Traceback (most recent call last): File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 226, in iter_batches blocks_owned_by_consumer = self._peek()._plan.execute()._owned_by_consumer File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 1319, in _peek first_dataset_gen = next(dataset_iter) File "/data/vijayi/dl_venv/lib64/python3.8/site-packages/ray/data/dataset_pipeline.py", line 732, in __next__ raise StopIteration StopIteration The above exception was the direct cause of the following exception: Traceback (most recent call last): File "python/ray/_raylet.pyx", line 850, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 902, in ray._raylet.execute_task