2017-02-16 51 views
0

在雲端Shell中運行示例代碼Google的@SlavenBilac posted以使用Google Cloud Machine Learning和Cloud Dataflow對圖像進行培訓和分類時發生錯誤。示例代碼包含錯誤再訓練使用Google Cloud ML服務和雲端Shell進行啓動

代碼卡住卡在global_step /秒:0

INFO 2017-02-16 06:28:36 -0600  master-replica-0    Start master session 538be2b71d17c4dc with config: 
ERROR 2017-02-16 06:28:36 -0600  master-replica-0    device_filters: "/job:ps" 
ERROR 2017-02-16 06:28:36 -0600  master-replica-0    device_filters: "/job:master/task:0" 
INFO 2017-02-16 06:28:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:30:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:32:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:34:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:36:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:38:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:40:39 -0600  master-replica-0    global_step/sec: 0 
<keeps repeating until I kill the job> 

基於谷歌的@JoshGC answer到類似的問題,我創建了一個全新的谷歌Cloud帳戶(與新結算帳戶,新項目,等),然後運行CloudShell安裝腳本和其他步驟來設置環境,然後針對示例花朵數據運行示例代碼。發生錯誤(如下所示),所以我不認爲原因與數據或帳戶配置有關。

如何從GoogleCloudPlatform/cloudml-samples/flowers修改文件以避免此錯誤?

摘錄:

運行示例代碼

[email protected]:~/google-cloud-ml/samples/flowers$ ./sample.sh 

Your active configuration is: [cloudshell-18758] 
Using job id: flowers_cfinley3_20170216_045347 

預處理似乎確定

python trainer/preprocess.py \ 
    --input_dict "$DICT_FILE" \ 
    --input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \ 
    --output_path "${GCS_PATH}/preprocess/train" \ 
    --cloud 

訓練開始

gcloud beta ml jobs submit training "$JOB_ID" \ 
    --module-name trainer.task \ 
    --package-path trainer \ 
    --staging-bucket "$BUCKET" \ 
    --region us-central1 \ 
    -- \ 
    --output_path "${GCS_PATH}/training" \ 
    --eval_data_paths "${GCS_PATH}/preproc/eval*" \ 
    --train_data_paths "${GCS_PATH}/preproc/train*" 
Job [flowers_cfinley3_20170216_045347] submitted successfully. 

培訓卡處global_step /秒:0

INFO 2017-02-16 06:24:48 -0600  unknown_task   Validating job requirements... 
INFO 2017-02-16 06:24:48 -0600  unknown_task   Job creation request has been successfully validated. 
INFO 2017-02-16 06:24:48 -0600  unknown_task   Job flowers_cfinley3_20170216_045347 is queued. 
INFO 2017-02-16 06:24:55 -0600  unknown_task   Waiting for job to be provisioned. 
INFO 2017-02-16 06:24:55 -0600  unknown_task   Waiting for TensorFlow to start. 
INFO 2017-02-16 06:28:27 -0600  master-replica-0    Running task with arguments: --cluster={"master": ["master-9a431abe8e-0:2222"]} --task={"type": "master", "index": 0} --job={ 
INFO 2017-02-16 06:28:27 -0600  master-replica-0     "package_uris": ["gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz"], 
INFO 2017-02-16 06:28:27 -0600  master-replica-0     "python_module": "trainer.task", 
INFO 2017-02-16 06:28:27 -0600  master-replica-0     "args": ["--output_path", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training", "--eval_data_paths", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval*", "--train_data_paths", "gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*"], 
INFO 2017-02-16 06:28:27 -0600  master-replica-0     "region": "us-central1" 
INFO 2017-02-16 06:28:27 -0600  master-replica-0    } --beta 
INFO 2017-02-16 06:28:28 -0600  master-replica-0    Running module trainer.task. 
INFO 2017-02-16 06:28:28 -0600  master-replica-0    Running command: gsutil -q cp gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz trainer-0.1.tar.gz 
INFO 2017-02-16 06:28:29 -0600  master-replica-0    Installing the package: gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz 
INFO 2017-02-16 06:28:29 -0600  master-replica-0    Running command: pip install --user --upgrade --force-reinstall trainer-0.1.tar.gz 
INFO 2017-02-16 06:28:29 -0600  master-replica-0    Processing ./trainer-0.1.tar.gz 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    Building wheels for collected packages: trainer 
INFO 2017-02-16 06:28:30 -0600  master-replica-0     Running setup.py bdist_wheel for trainer: started 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    creating '/tmp/tmpn9HeiIpip-wheel-/trainer-0.1-cp27-none-any.whl' and adding '.' to it 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer/model.py' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer/__init__.py' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer/util.py' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer/preprocess.py' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer-0.1.dist-info/DESCRIPTION.rst' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer-0.1.dist-info/metadata.json' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer-0.1.dist-info/top_level.txt' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer-0.1.dist-info/METADATA' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    adding 'trainer-0.1.dist-info/RECORD' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0     Running setup.py bdist_wheel for trainer: finished with status 'done' 
INFO 2017-02-16 06:28:30 -0600  master-replica-0     Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    Successfully built trainer 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    Installing collected packages: trainer 
INFO 2017-02-16 06:28:30 -0600  master-replica-0    Successfully installed trainer-0.1 
INFO 2017-02-16 06:28:31 -0600  master-replica-0    Running command: python -m trainer.task --output_path gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training --eval_data_paths gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval* --train_data_paths gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train* 
INFO 2017-02-16 06:28:34 -0600  master-replica-0    Original job data: {u'package_uris': [u'gs://wordthree-wordfour-7654321-ml/flowers_cfinley3_20170216_045347/edafa5c7debed9fc8612af3c0dd33d145e23502e/trainer-0.1.tar.gz'], u'args': [u'--output_path', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/training', u'--eval_data_paths', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/eval*', u'--train_data_paths', u'gs://wordthree-wordfour-7654321-ml/cfinley3/flowers_cfinley3_20170216_045347/preproc/train*'], u'python_module': u'trainer.task', u'region': u'us-central1'} 
INFO 2017-02-16 06:28:34 -0600  master-replica-0    setting eval batch size to 100 
INFO 2017-02-16 06:28:34 -0600  master-replica-0    Starting master/0 
INFO 2017-02-16 06:28:34 -0600  master-replica-0    Initialize GrpcChannelCache for job master -> {0 -> localhost:2222} 
INFO 2017-02-16 06:28:34 -0600  master-replica-0    Started server with target: grpc://localhost:2222 
WARNING 2017-02-16 06:28:35 -0600  master-replica-0    From /root/.local/lib/python2.7/site-packages/trainer/task.py:211 in run_training.: merge_all_summaries (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. 
WARNING 2017-02-16 06:28:35 -0600  master-replica-0    Instructions for updating: 
WARNING 2017-02-16 06:28:35 -0600  master-replica-0    Please switch to tf.summary.merge_all. 
WARNING 2017-02-16 06:28:35 -0600  master-replica-0    From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py:270 in merge_all_summaries.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. 
WARNING 2017-02-16 06:28:35 -0600  master-replica-0    Instructions for updating: 
WARNING 2017-02-16 06:28:35 -0600  master-replica-0    Please switch to tf.summary.merge. 
INFO 2017-02-16 06:28:36 -0600  master-replica-0    Start master session 538be2b71d17c4dc with config: 
ERROR. 2017-02-16 06:28:36 -0600  master-replica-0    device_filters: "/job:ps" 
ERROR. 2017-02-16 06:28:36 -0600  master-replica-0    device_filters: "/job:master/task:0" 
INFO 2017-02-16 06:28:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:30:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:32:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:34:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:36:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:38:39 -0600  master-replica-0    global_step/sec: 0 
INFO 2017-02-16 06:40:39 -0600  master-replica-0    global_step/sec: 0 

回答

1

看到這個相似的question。檢查你的輸入數據文件,確保它們不是空的。如果您的數據文件爲空,可能會導致此行爲,因爲TF會永久等待數據。

+0

謝謝你的迴應。在發佈我的問題之前,我閱讀了這個問題並檢查了我的輸入文件,這些文件不是空的。最近於2月11日,示例代碼在我的雲端Shell中無誤地運行。爲了消除原因是我的數據或帳戶配置,在2月15日(發佈我的問題之前的一天),我創建了一個全新的GoogleCloud帳戶,僅將其用於在2月16日運行Cloud Shell中的示例代碼(使用Google發佈的示例花朵數據),並將錯誤發佈到我的問題中。您是否認爲最近的更改打破了示例代碼以重新訓練CloudShell中的鮮花? –

+0

好消息。我今天注意到示例代碼在2月21日更新(我發佈我的問題後的五天),所以我運行代碼並沒有遇到錯誤。標記這個問題爲答案,謝謝。 –