Description
Your issue may already be reported!
Please search on the issue tracker before creating one.
Pytorch transfer learning and model load failure.
I tried out the transfer learning like shown here (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) and this example(https://github.com/pytorch/tutorials/blob/master/beginner_source/transfer_learning_tutorial.py), and then added a save model to the end of the python script. (https://www.dropbox.com/s/jru4p9hbbazm7zn/transfer_learning_tutorial.py?dl=0)
I exported the model and converted to .Mar format (https://www.dropbox.com/s/m29a1h1y0u6haa8/model_save.tar?dl=0)
using this archiver command:
torch-model-archiver --model-name transferlearningmodel --version 1.0 --model-file ~/torchserve/serve/examples/image_classifier/resnet_18/model.py --serialized-file ~/pytorch/tutorials/beginner_source/model_save/transferlearningmodel.pth --export-path model_save --extra-files ~/torchserve/serve/examples/image_classifier/index_to_name.json --handler image_classifier
Once the archiver convert to .MAR was complete I started the torch serve:
(base) MacBook-Pro:~/pytorch/tutorials/beginner_source quantum-fusion$ torchserve --start --ncs --model-store ~/pytorch/tutorials/beginner_source/model_save --models transferlearningmodel.mar
The model did not load successfully, see errors: Load model failed: transferlearningmodel, error: Worker died.
2020-08-16 16:07:01,994 [DEBUG] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1668)
at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
2020-08-16 16:07:01,994 [WARN ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: transferlearningmodel, error: Worker died.
2020-08-16 16:07:01,994 [DEBUG] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - W-9013-transferlearningmodel_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-08-16 16:07:01,994 [INFO ] W-9013-transferlearningmodel_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9013-transferlearningmodel_1.0-stdout
2020-08-16 16:07:01,994 [INFO ] W-9013-transferlearningmodel_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9013-transferlearningmodel_1.0-stderr
2020-08-16 16:07:01,994 [WARN ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9013-transferlearningmodel_1.0-stderr
2020-08-16 16:07:01,994 [WARN ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9013-transferlearningmodel_1.0-stdout
2020-08-16 16:07:01,994 [INFO ] W-9013-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9013 in 55 seconds.
2020-08-16 16:07:02,003 [INFO ] KQueueEventLoopGroup-4-29 org.pytorch.serve.wlm.WorkerThread - 9004 Worker disconnected. WORKER_STARTED
2020-08-16 16:07:02,003 [DEBUG] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2020-08-16 16:07:02,003 [DEBUG] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1668)
at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
2020-08-16 16:07:02,004 [WARN ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: transferlearningmodel, error: Worker died.
2020-08-16 16:07:02,004 [DEBUG] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - W-9004-transferlearningmodel_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-08-16 16:07:02,004 [INFO ] W-9004-transferlearningmodel_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9004-transferlearningmodel_1.0-stdout
2020-08-16 16:07:02,004 [INFO ] W-9004-transferlearningmodel_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9004-transferlearningmodel_1.0-stderr
2020-08-16 16:07:02,004 [WARN ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-transferlearningmodel_1.0-stderr
2020-08-16 16:07:02,004 [WARN ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-transferlearningmodel_1.0-stdout
2020-08-16 16:07:02,004 [INFO ] W-9004-transferlearningmodel_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9004 in 55 seconds.
(base) MacBook-Pro:/pytorch/tutorials/beginner_source quantum-fusion$ torchserve --stop/pytorch/tutorials/beginner_source quantum-fusion$ 2020-08-16 16:07:25,194 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.
TorchServe has stopped.
2020-08-16 16:07:22,985 [INFO ] KQueueEventLoopGroup-2-2 org.pytorch.serve.ModelServer - Management model server stopped.
2020-08-16 16:07:22,985 [INFO ] KQueueEventLoopGroup-2-1 org.pytorch.serve.ModelServer - Inference model server stopped.
(base) MacBook-Pro: