-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 #801
Comments
请问您的问题解决了吗,我也是这个问题 |
解决了,还是目录的位置选择不对,需要再修改一下
木南
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2024年6月10日(星期一) 下午2:59
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801)
请问您的问题解决了吗,我也是这个问题
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
您说的是数据集的位置吗,我的数据集就在个人账号的根目录下,与程序目录同级,但是我跑起来就是这个问题 |
还包括你运行文件的路径,建议改成绝对路径试一下
木南
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2024年6月10日(星期一) 下午3:14
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801)
解决了,还是目录的位置选择不对,需要再修改一下 木南 @.***
…
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午2:59 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801) 请问您的问题解决了吗,我也是这个问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
您说的是数据集的位置吗,我的数据集就在个人账号的根目录下,与程序目录同级,但是我跑起来就是这个问题
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
我已经解决,就是数据集文件位置的原因,谢谢 |
这是完整的报错,网上搜了,很多讲的是进程之间通信的问题,这个问题要怎么解决呀?应该在代码中修改哪些位置?
Epoch: [0] [ 0/366] eta: 0:31:04 lr: 0.000000 loss: 2.1887 (2.1887) time: 5.0952 data: 0.7384
Epoch: [0] [ 10/366] eta: 0:15:24 lr: 0.000003 loss: 0.5890 (2.3867) time: 2.5974 data: 0.0681
Epoch: [0] [ 20/366] eta: 0:14:01 lr: 0.000006 loss: 0.2813 (1.7838) time: 2.2994 data: 0.0011
Epoch: [0] [ 30/366] eta: 0:13:21 lr: 0.000009 loss: 2.2992 (1.4588) time: 2.2671 data: 0.0010
Epoch: [0] [ 40/366] eta: 0:12:44 lr: 0.000011 loss: 1.2415 (1.4418) time: 2.2521 data: 0.0010
Epoch: [0] [ 50/366] eta: 0:12:14 lr: 0.000014 loss: 1.4934 (1.4652) time: 2.2295 data: 0.0010
Epoch: [0] [ 60/366] eta: 0:11:49 lr: 0.000017 loss: 0.5944 (1.4093) time: 2.2702 data: 0.0010
Epoch: [0] [ 70/366] eta: 0:11:23 lr: 0.000019 loss: 0.6704 (1.4132) time: 2.2722 data: 0.0010
Epoch: [0] [ 80/366] eta: 0:11:04 lr: 0.000022 loss: 0.3548 (1.3494) time: 2.3282 data: 0.0010
Epoch: [0] [ 90/366] eta: 0:10:39 lr: 0.000025 loss: 0.3015 (1.2649) time: 2.3509 data: 0.0011
Epoch: [0] [100/366] eta: 0:10:14 lr: 0.000028 loss: 0.6640 (1.2471) time: 2.2596 data: 0.0011
Epoch: [0] [110/366] eta: 0:09:51 lr: 0.000030 loss: 2.1179 (1.2050) time: 2.2716 data: 0.0010
Epoch: [0] [120/366] eta: 0:09:27 lr: 0.000033 loss: 2.0124 (1.2004) time: 2.3035 data: 0.0010
Epoch: [0] [130/366] eta: 0:09:04 lr: 0.000036 loss: 1.1753 (1.1981) time: 2.2837 data: 0.0010
Epoch: [0] [140/366] eta: 0:08:39 lr: 0.000039 loss: 2.3567 (1.2141) time: 2.2321 data: 0.0010
Epoch: [0] [150/366] eta: 0:08:18 lr: 0.000041 loss: 0.5729 (1.1973) time: 2.3115 data: 0.0010
Epoch: [0] [160/366] eta: 0:07:54 lr: 0.000044 loss: 0.4893 (1.2001) time: 2.3283 data: 0.0011
Epoch: [0] [170/366] eta: 0:07:30 lr: 0.000047 loss: 0.7241 (1.1839) time: 2.2304 data: 0.0011
Epoch: [0] [180/366] eta: 0:07:06 lr: 0.000050 loss: 1.3635 (1.1723) time: 2.2145 data: 0.0010
Traceback (most recent call last):
File "/public/home/2023020919/FCN/train.py", line 206, in
main(args)
File "/public/home/2023020919/FCN/train.py", line 141, in main
mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch,
File "/public/home/2023020919/FCN/train_utils/train_and_evals.py", line 42, in train_one_epoch
for image, target in metric_logger.log_every(data_loader, print_freq, header):
File "/public/home/2023020919/FCN/train_utils/distrributed_utils.py", line 189, in log_every
for obj in iterable:
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd
fd = df.detach()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
srun: error: gpu03: task 0: Exited with exit code 1
The text was updated successfully, but these errors were encountered: