快来加入 TensorFlowers 大家庭!
您需要 登录 才可以下载或查看,没有帐号?加入社区
x
本帖最后由 AngelZheng 于 2018-5-19 16:19 编辑
我在使用 Google TensorFlow Object Detection API 训练 ssd_mobileNet_v1 模型的时候遇到了个问题,我在只使用一个 1050Ti 的 GPU 训练时,训练没有出现问题,并且默认的 200000 步训练完毕。但当我使用两个 1050Ti 的 GPU 做训练时,训练到一定的步数会出现 Nan in summary histogram 的问题,请问是怎么回事呢?
TensorFlow-GPU 的版本是 1.6.0
Models 库今天(2018/05/15)刚刚从 Github 中 Clone 的 Master 分支
下面是错误的关键字:INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm /moving_variance
下面是错误的完整内容
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm /moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/jo b:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/tag, FeatureExtr actor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/read)]] [[Node: Loss/assert_equal_5/Assert/Assert/_1184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GP U:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_6302_Loss/assert_equal_5/Assert/A ssert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] Caused by op 'ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance', defined at: File "object_detection/train.py", line 184, in <module> tf.app.run() File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/home/jovyan/Appendix/tensorflow_models/research/object_detection/trainer.py", line 338, in train model_var.op.name, model_var)) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 193, in histogram tag=tag, values=values, name=scope) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op op_def=op_def) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNor m/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/jo b:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/tag, FeatureExtr actor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/read)]] [[Node: Loss/assert_equal_5/Assert/Assert/_1184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GP U:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_6302_Loss/assert_equal_5/Assert/A ssert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_dept hwise/BatchNorm/moving_variance [[Node: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/jo b:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/tag, FeatureExtr actor/MobilenetV1/Conv2d_2_depthwise/BatchNorm/moving_variance/read)]] [[Node: Loss/assert_equal_5/Assert/Assert/_1184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GP U:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_6302_Loss/assert_equal_5/Assert/A ssert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
我在 Stack Overflow 上搜索过相关的问题,上面说可能是数据集过小或者标签文件。
经过测试,我在单 1050Ti GPU 上训练未出现问题,因此数据集是没问题的。
是不是 TensorFlow Object Detection API 不支持多 GPU 训练或者需要单独做设置呢?我看 models 库里很多模型介绍都是以单 GPU 为标准(虽然那个卡略强)。
我知道答案
回答被采纳将会获得 10 金币 + 5 金币 已有2人回答
|