返回 导航

Python / AI

hangge.com

YOLO - 解决训练模型时报This might be caused by insufficient shared memory错误

作者:hangge | 2022-11-29 08:10

1,问题描述

最近在 Docker 中配了一套深度学习的环境,但在容器中使用 YOLO 训练时便会报如下错误:
AutoAnchor: 4.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset \u2705
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Plotting labels to runs/train/exp2/labels.jpg...
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":300, please report a bug to PyTorch. unable to write to file </torch_1123_0>
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
DataLoader worker (pid 942) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit
....

2,问题原因

    这是由于 PyTorch 使用共享内存进行进程之间共享数据,当训练的 batch size 设置得过大时,会由于 shared memory 不够造成此错误(因为 docker 限制了 shm,默认为 64M)。

3,解决办法

(1)一种方法就是在容器启动命令上添加 --shm-size 参数,增加 shm 大小
docker run --shm-size = 256m .....

(2)另一种方法就是启动命令上添加 --ipc=host,让容器与主机共享内存。
docker run --ipc=host .....
评论

全部评论(0)

回到顶部