Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG]preprocess.sh 1 criteo failed with 'Schema' object has no attribute 'write' #435

Open
SeekPoint opened this issue Dec 9, 2023 · 1 comment
Assignees

Comments

@SeekPoint
Copy link

root@0da37f00b33c:/share/criteo_data# bash preprocess.sh 1 criteo_tmp nvt 0 0 0
Warning: existing criteo_tmp is erased
Preprocessing script: NVTabular
Getting the first few examples from the uncompressed dataset...
Counting the number of samples in day_1 dataset...
The first 45840617 examples will be used in day_1 dataset.
Shuffling dataset...
Preprocessing...
Splitting into 36672493-sample training, 4584062-sample val, and 4584062-sample test datasets...
/usr/local/lib/python3.10/dist-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'
warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")
/usr/local/lib/python3.10/dist-packages/merlin/dtypes/mappings/torch.py:43: UserWarning: PyTorch dtype mappings did not load successfully due to an error: No module named 'torch'
warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")
2023-12-08 17:36:24,676 NVTabular processing
2023-12-08 17:36:24,699 To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-12-08 17:36:24,702 State start
2023-12-08 17:36:24,712 Scheduler at: tcp://127.0.0.1:40579
2023-12-08 17:36:24,712 dashboard at: 127.0.0.1:8787
2023-12-08 17:36:24,781 Start Nanny at: 'tcp://127.0.0.1:40985'
/usr/local/lib/python3.10/dist-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'
warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")
/usr/local/lib/python3.10/dist-packages/merlin/dtypes/mappings/torch.py:43: UserWarning: PyTorch dtype mappings did not load successfully due to an error: No module named 'torch'
warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")
2023-12-08 17:36:26,969 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-12-08 17:36:26,969 Creating preload: dask_cuda.initialize
2023-12-08 17:36:26,969 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-12-08 17:36:26,969 Import preload module: dask_cuda.initialize
2023-12-08 17:36:26,993 Run preload setup: dask_cuda.initialize
2023-12-08 17:36:27,086 Start worker at: tcp://127.0.0.1:46779
2023-12-08 17:36:27,086 Listening to: tcp://127.0.0.1:46779
2023-12-08 17:36:27,086 Worker name: 0
2023-12-08 17:36:27,086 dashboard at: 127.0.0.1:38661
2023-12-08 17:36:27,086 Waiting to connect to: tcp://127.0.0.1:40579
2023-12-08 17:36:27,086 -------------------------------------------------
2023-12-08 17:36:27,086 Threads: 1
2023-12-08 17:36:27,086 Memory: 251.56 GiB
2023-12-08 17:36:27,087 Local Directory: /tmp/dask-worker-space/worker-1fd2e7px
2023-12-08 17:36:27,087 Starting Worker plugin PreImport-047eaac1-e793-4dd5-b59e-bd5028f4f909
2023-12-08 17:36:27,087 Starting Worker plugin RMMSetup-c63a5516-1d21-4b3e-8187-a1f883e8fb5e
2023-12-08 17:36:27,087 Starting Worker plugin CPUAffinity-1188f5ac-00e8-4229-9bea-18511f88e776
2023-12-08 17:36:27,087 -------------------------------------------------
2023-12-08 17:36:27,097 Register worker <WorkerState 'tcp://127.0.0.1:46779', name: 0, status: init, memory: 0, processing: 0>
2023-12-08 17:36:27,103 Starting worker compute stream, tcp://127.0.0.1:46779
2023-12-08 17:36:27,103 Starting established connection to tcp://127.0.0.1:48730
2023-12-08 17:36:27,103 Registered to: tcp://127.0.0.1:40579
2023-12-08 17:36:27,103 -------------------------------------------------
2023-12-08 17:36:27,104 Starting established connection to tcp://127.0.0.1:40579
2023-12-08 17:36:27,142 Receive client connection: Client-508fa7e6-95f0-11ee-8237-0242ac110004
2023-12-08 17:36:27,143 Starting established connection to tcp://127.0.0.1:48732
2023-12-08 17:36:27,147 Run out-of-band function 'reinitialize'
/usr/local/lib/python3.10/dist-packages/merlin/core/utils.py:361: FutureWarning: The client argument is deprecated from DaskExecutor and will be removed in a future version of NVTabular. By default, a global client in the same python context will be detected automatically, and merlin.utils.set_dask_client (as well as Distributed and Serial) can be used for explicit control.
warnings.warn(
2023-12-08 17:36:45,511 Preprocessing
2023-12-08 17:36:45,863 Train Datasets Preprocessing.....
2023-12-08 17:36:46,095 Run out-of-band function 'clean_worker_cache'
2023-12-08 17:36:46,332 Run out-of-band function 'clean_worker_cache'
2023-12-08 17:37:07,465 Run out-of-band function 'clean_worker_cache'
2023-12-08 17:37:08,855 Run out-of-band function 'clean_worker_cache'
Traceback (most recent call last):
File "/share/criteo_data/criteo_script/preprocess_nvt.py", line 418, in
process_NVT(args)
File "/share/criteo_data/criteo_script/preprocess_nvt.py", line 216, in process_NVT
workflow.transform(train_ds_iterator).to_hugectr(
File "/usr/local/lib/python3.10/dist-packages/merlin/io/dataset.py", line 1070, in to_hugectr
self.schema.write(output_path)
AttributeError: 'Schema' object has no attribute 'write'

@jershi425
Copy link
Collaborator

Hi @SeekPoint , thanks for the feedback. We will fix this issue and let you know.

@jershi425 jershi425 self-assigned this Dec 14, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants