发新帖

TensorFlow ObjectDetection 多 GPU

[复制链接]
1280 2

快来加入 TensorFlowers 大家庭!

您需要 登录 才可以下载或查看,没有帐号?加入社区

x
本帖最后由 AngelZheng 于 2018-5-19 16:19 编辑

我在使用 Google TensorFlow Object Detection API 训练 ssd_mobileNet_v1 模型的时候遇到了个问题,我在只使用一个 1050Ti 的 GPU 训练时,训练没有出现问题,并且默认的 200000 步训练完毕。但当我使用两个 1050Ti 的 GPU 做训练时,训练到一定的步数会出现 Nan in summary histogram 的问题,请问是怎么回事呢?

TensorFlow-GPU 的版本是 1.6.0
Models 库今天(2018/05/15)刚刚从 Github 中 Clone 的 Master 分支

下面是错误的关键字:
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm
/moving_variance

下面是错误的完整内容

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm
/moving_variance
         [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/jo
b:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/tag, FeatureExtr
actor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/read)]]
         [[Node: Loss/assert_equal_5/Assert/Assert/_1184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GP
U:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_6302_Loss/assert_equal_5/Assert/A
ssert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance', defined at:
  File "object_detection/train.py", line 184, in <module>
    tf.app.run()
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 180, in main
    graph_hook_fn=graph_rewriter_fn)
  File "/home/jovyan/Appendix/tensorflow_models/research/object_detection/trainer.py", line 338, in train
    model_var.op.name, model_var))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 193, in histogram
    tag=tag, values=values, name=scope)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNor
m/moving_variance
         [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/jo
b:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/tag, FeatureExtr
actor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/read)]]
         [[Node: Loss/assert_equal_5/Assert/Assert/_1184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GP
U:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_6302_Loss/assert_equal_5/Assert/A
ssert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_dept
hwise/BatchNorm/moving_variance
         [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/jo
b:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/tag, FeatureExtr
actor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/read)]]
         [[Node: Loss/assert_equal_5/Assert/Assert/_1184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GP
U:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_6302_Loss/assert_equal_5/Assert/A
ssert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

我在 Stack Overflow 上搜索过相关的问题,上面说可能是数据集过小或者标签文件。

经过测试,我在单 1050Ti GPU 上训练未出现问题,因此数据集是没问题的。

是不是 TensorFlow Object Detection API 不支持多 GPU 训练或者需要单独做设置呢?我看 models 库里很多模型介绍都是以单 GPU 为标准(虽然那个卡略强)。


我知道答案 回答被采纳将会获得10 金币 + 5 金币 已有2人回答
本楼点评(0) 收起

精彩评论2

滴血森卡  TF豆豆  发表于 2018-5-15 23:12:16 | 显示全部楼层
下面是错误的完整内容 这里好像不完整啊 老铁
本楼点评(1) 收起
AngelZheng  TF荚荚  发表于 2018-5-19 16:18:18 | 显示全部楼层
本帖最后由 AngelZheng 于 2018-5-19 16:22 编辑

emmmmmmmmmmm 我找到原因了,需要在执行训练命令的时候传入多 GPU 训练的参数。由于是使用的多 GPU,因此需要指定 ws 和 ps 的数量

  1. python object_detection/train.py --train_dir='data
  2. ' --pipeline_config_path='personal.config' --num_clones=2 --ps_tasks=1
复制代码


是我在 Github 闲逛的时候找到的 issues。
https://github.com/tensorflow/models/issues/1972
本楼点评(0) 收起
您需要登录后才可以回帖 登录 | 加入社区

本版积分规则

快速回复 返回顶部 返回列表