Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error reproducing competition results #32

Open
ndkulkarni opened this issue Apr 25, 2019 · 2 comments
Open

Error reproducing competition results #32

ndkulkarni opened this issue Apr 25, 2019 · 2 comments

Comments

@ndkulkarni
Copy link

I am trying to reproduce the competition results based on the instructions in the README.

  1. I download and unzip the files from the kaggle competition into the data/ folder

  2. I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder

  3. I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this?
Full output from train command

@limu1928
Copy link

I have the same problem. Did you figure it out?

@limu1928
Copy link

I am trying to reproduce the competition results based on the instructions in the README.

  1. I download and unzip the files from the kaggle competition into the data/ folder
  2. I run the command python make_features.py data/vars --add_days=63 which creates the following pickle files: 2017-08-15_2017-09-11.pkl, all.pkl, train_2.pkl and the directory vars/ in the data/ folder
  3. I run the trainer python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 and receive the following error:

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(944): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'

I am using a p3.2xlarge AWS instance with the Deep Learning AMI with Python 3.6.5 and Tensorflow-gpu==1.12.0

If I downgrade to TF-GPU 1.10, I still get the same error.

How can I resolve this?
Full output from train command
SImply restart a new instance will work...

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants