Kernel dying when trying to train model on aws sagemaker #2395
Replies: 3 comments
-
@shubhamsharma1609
Did you mean the training job was launched from the SageMaker hosted notebook instance. |
Beta Was this translation helpful? Give feedback.
-
I did not use sagemaker pythonsdk. I installed keras-unet and with higher batch size i.e. 30 ,I don't get any error but the kernel dies. I am utilising the compute power of AWS sagemaker and nothing else from the sdk.
|
Beta Was this translation helpful? Give feedback.
-
I have somehow solved the problem by reducing the batch size, now my notebook is running for each epoch. I just want to ask that the Jupyter session is of 12 hours but my training job will take more than that, Can you please guide how can I increase the Jupyter session duration, so that it does not stop training. Also, can I close my browser but do not stop the instance and at a later time connect to the jupyter notebook and check the training is happening continuously ? |
Beta Was this translation helpful? Give feedback.
-
I am trying to train a batch size of 30 images of size 512*512 using ml.p2.xlarge instance on conda_amazonei_tensorflow_p36 notebook. The kernel dies without giving any error. I am training utilising keras-unet package available through pip.
Beta Was this translation helpful? Give feedback.
All reactions