2017-04-12 74 views
2

當我訓練我的圖形時,發現我忘記在圖形中添加丟失。但是我已經訓練了很長時間並且得到了一些檢查點。那麼我是否可以加載檢查點並添加一個退出,然後繼續培訓?我的代碼是現在這個樣子:如何加載檢查點文件並繼續以稍微不同的圖形結構進行培訓

# create a graph 
vgg_fcn = fcn8_vgg_ours.FCN8VGG() 
with tf.name_scope("content_vgg"): 
    vgg_fcn.build(batch_images, train = True, debug=True) 
labels = tf.placeholder("int32", [None, HEIGHT, WIDTH]) 
# do something 
... 
##### 
init_glb = tf.global_variables_initializer() 
init_loc = tf.local_variables_initializer() 
sess.run(init_glb) 
sess.run(init_loc) 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 
ckpt_dir = "./checkpoints" 
if not os.path.exists(ckpt_dir): 
    os.makedirs(ckpt_dir) 
ckpt = tf.train.get_checkpoint_state(ckpt_dir) 
start = 0 
if ckpt and ckpt.model_checkpoint_path: 
    start = int(ckpt.model_checkpoint_path.split("-")[1]) 
    print("start by epoch: %d"%(start)) 
    saver = tf.train.Saver() 
    saver.restore(sess, ckpt.model_checkpoint_path) 
last_save_epoch = start 
# continue training 

所以,如果我改變了FCN8VGG(添加一些輟學層)的結構,然後將它使用的元文件來替換我剛剛創建的圖表?如果會的話,我怎麼能改變結構繼續訓練,而無需再次從頭開始訓練?

+0

在官方網站上傳輸學習是一個教程,如何修改模型的最後一層,但我發現沒有添加圖層的示例,contrib中的'graph_editor'可能有一些幫助 –

回答

2

下面是使用來自其他模型檢查點的變量初始化新模型的一個簡單示例。請注意,如果您只能通過variable_scopeinit_from_checkpoint,事情就會簡單得多,但在此我假設原始模型並非設計時考慮到了恢復。

首先定義一個簡單的模型,一些變量,並做一些訓練:

import tensorflow as tf 

def first_model(): 
    with tf.Graph().as_default(): 
    fake_input = tf.constant([[1., 2., 3., 4.], 
           [5., 6., 7., 8.]]) 
    layer_one_output = tf.contrib.layers.fully_connected(
     inputs=fake_input, num_outputs=5, activation_fn=None) 
    layer_two_output = tf.contrib.layers.fully_connected(
     inputs=layer_one_output, num_outputs=1, activation_fn=None) 
    target = tf.constant([[10.], [-3.]]) 
    loss = tf.reduce_sum((layer_two_output - target) ** 2) 
    train_op = tf.train.AdamOptimizer(0.01).minimize(loss) 
    init_op = tf.global_variables_initializer() 
    saver = tf.train.Saver() 
    with tf.Session() as session: 
     session.run(init_op) 
     for i in range(1000): 
     _, evaled_loss = session.run([train_op, loss]) 
     if i % 100 == 0: 
      print(i, evaled_loss) 
     saver.save(session, './first_model_checkpoint') 

運行first_model(),培訓看起來很好,我們得到寫的first_model_checkpoint:

0 109.432 
100 0.0812649 
200 8.97705e-07 
300 9.64064e-11 
400 9.09495e-13 
500 0.0 
600 0.0 
700 0.0 
800 0.0 
900 0.0 

接下來,我們可以定義一個完全新的模型在不同的圖中,並初始化它與該檢查點的first_model共享的變量:

def second_model(): 
    previous_variables = [ 
     var_name for var_name, _ 
     in tf.contrib.framework.list_variables('./first_model_checkpoint')] 
    with tf.Graph().as_default(): 
    fake_input = tf.constant([[1., 2., 3., 4.], 
           [5., 6., 7., 8.]]) 
    layer_one_output = tf.contrib.layers.fully_connected(
     inputs=fake_input, num_outputs=5, activation_fn=None) 
    # Add a batch_norm layer, which creates some new variables. Replacing this 
    # with tf.identity should verify that the model one variables are faithfully 
    # restored (i.e. the loss should be the same as at the end of model_one 
    # training). 
    batch_norm_output = tf.contrib.layers.batch_norm(layer_one_output) 
    layer_two_output = tf.contrib.layers.fully_connected(
     inputs=batch_norm_output, num_outputs=1, activation_fn=None) 
    target = tf.constant([[10.], [-3.]]) 
    loss = tf.reduce_sum((layer_two_output - target) ** 2) 
    train_op = tf.train.AdamOptimizer(0.01).minimize(loss) 
    # We're done defining variables, now work on initializers. First figure out 
    # which variables in the first model checkpoint map to variables in this 
    # model. 
    restore_map = {variable.op.name:variable for variable in tf.global_variables() 
        if variable.op.name in previous_variables} 
    # Set initializers for first_model variables to restore them from the 
    # first_model checkpoint 
    tf.contrib.framework.init_from_checkpoint(
     './first_model_checkpoint', restore_map) 
    # For new variables, global_variables_initializer will initialize them 
    # normally. For variables in restore_map, they will be initialized from the 
    # checkpoint. 
    init_op = tf.global_variables_initializer() 
    saver = tf.train.Saver() 
    with tf.Session() as session: 
     session.run(init_op) 
     for i in range(10): 
     _, evaled_loss = session.run([train_op, loss]) 
     print(i, evaled_loss) 
     saver.save(session, './second_model_checkpoint') 

在這種情況下,previous_variables樣子:

['beta1_power', 'beta2_power', 'fully_connected/biases', 'fully_connected/biases/Adam', 'fully_connected/biases/Adam_1', 'fully_connected/weights', 'fully_connected/weights/Adam', 'fully_connected/weights/Adam_1', 'fully_connected_1/biases', 'fully_connected_1/biases/Adam', 'fully_connected_1/biases/Adam_1', 'fully_connected_1/weights', 'fully_connected_1/weights/Adam', 'fully_connected_1/weights/Adam_1'] 

注意,因爲我們沒有使用任何變量的作用域,命名取決於順序層定義。如果名稱更改,則需要手動構建restore_map

如果我們運行second_model,損失跳起來最初因爲batch_norm層沒有被訓練:

0 38.5976 
1 36.4033 
2 33.3588 
3 29.8555 
4 26.169 
5 22.5185 
6 19.0838 
7 16.0096 
8 13.4035 
9 11.3298 

但是,更換batch_normtf.identity驗證了先前訓練的變量已經恢復。

+0

非常感謝! –

相關問題