雙批處理Tensorflow輸入數據

我正在爲字符串數據的令牌分類實現一個convnet。 I 需要從TFRecord中取出字符串數據，批量洗牌，然後執行一些擴展數據的處理，然後再批量處理。這是可能的兩個batch_shuffle操作？雙批處理Tensorflow輸入數據

這是我需要做的：

排隊文件名成開fileQueue
每個序列化實例，放到一個shuffle_batch
當我決絕的洗牌批次中的每個例子中，我需要按照序列長度複製它，協調位置向量，這將爲第一批的每個原始示例創建多個示例。我需要再次批量處理。

當然，一個解決方法就是預處理加載到TF之前的數據，但會佔用更多的方式比磁盤空間是必要的。

DATA

下面是一些示例數據。我有兩個「例子」。各實施例包含一個標記化的句子和標籤爲每個令牌的特徵：

sentences = [ 
      [ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'], 
      ['then', 'the', 'lazy', 'dog', 'slept', '.'] 
      ] 
sent_labels = [ 
      ['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'], 
      ['O', 'O', 'O', 'ANIMAL', 'O', 'O'] 
      ]

每個「實施例」現在具有特徵如下（一些reducution爲了清楚）：

features { 
    feature { 
    key: "labels" 
    value { 
     bytes_list { 
     value: "O" 
     value: "O" 
     value: "O" 
     value: "ANIMAL" 
     ... 
     } 
    } 
    } 

    feature { 
    key: "sentence" 
    value { 
     bytes_list { 
     value: "the" 
     value: "quick" 
     value: "brown" 
     value: "fox" 
     ... 
     } 
    } 
    } 
}

轉化

批處理稀疏數據後，我收到一個作爲令牌列表的句子：

['the'，'quick'，'brown'，'fo X」，...]

我需要PAD列表第一至預定SEQ_LEN，然後插入位置索引到每個例子中，旋轉的位置，使得托克欲分類是在pos 0，並且每個位置標記是相對 0位置：

[ 
['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the' 
['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick 
['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown 
['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox 
]

配料和ReBatching數據

這裏是什麼，我試圖做一個簡化版本：

# Enqueue the Filenames and serialize 
filenames =[outfilepath] 
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ') 
reader = tf.TFRecordReader() 
key, serialized_example = reader.read(fq) 

# Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time 
initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue) 


# Parse Sparse Tensors, make into single dense Tensor 
# ['the', 'quick', 'brown', 'fox'] 
parsed = tf.parse_example(data_batch, features=feature_mapping) 
dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>') 
sent_len = tf.shape(dense_tensor_sentence)[1] 

SEQ_LEN = 5 
NUM_PADS = SEQ_LEN - sent_len 
#['the', 'quick', 'brown', 'fox', 'PAD'] 
padded_sentence = pad(dense_tensor_sentence, NUM_PADS) 

# make sent_len X SEQ_LEN copy of sentence, position vectors 
#[ 
# ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ] 
# ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] 
# ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] 
# ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] 
# NOTE: There is no row where PAD is with a position 0, because I don't 
# want to classify the PAD token 
#] 
examples_with_positions = replicate_and_insert_positions(padded_sentence) 

# While my SEQ_LEN will be constant, the sent_len will not. Therefore, 
#I don't know the number of rows, but I can guarantee the number of 
# columns. shape = (?,SEQ_LEN) 

dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN) 

# Try Random Shuffle Queue: 

# Rebatch <-- This is where the problem is 
#reshape_concat.set_shape((None, SEQ_LEN)) 

random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,)) 
random_queue.enqueue_many(dynamic_input) 
batch = random_queue.dequeue_many(4) 


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables()) 

sess = create_session() 
sess.run(init_op) 

#tf.get_default_graph().finalize() 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 

try: 
    i = 0 
    while True: 
    print sess.run(batch) 

    i += 1 
except tf.errors.OutOfRangeError as e: 
    print "No more inputs."

編輯

我現在嘗試使用RandomShuffleQueue。在每個隊列中，我想排列一個具有形狀的批處理（無，SEQ_LEN）。我修改了上面的代碼來反映這一點。

我不再獲得關於輸入形狀投訴，但排隊掛確實在sess.run(batch)

來源

2017-02-09 Neal

只是想了解。你第二次批量生產時，你想把這些位置矩陣分成多個句子，對嗎？那些不會有不同的長度，在這種情況下，將它們分配在一個密集的張量中是不可能的？ –

對不起，我忘了提及我將每個輸入PAD到一個常量SEQ_LEN。我重寫了代碼示例，希望能夠澄清這些問題。我得到一個句子，填入它，然後平鋪和重塑句子，使得每個記號與一個位置矢量連接。第二批的輸入將是shape =（sent_len，SEQ_LEN）。但是因爲我不知道sent_len，我不能使用QueueRunners – Neal

在這種情況下'enqueue_many'是你想要的嗎？然後批處理（sent_len_1 + sent_len_2 + ...，SEQ_LEN）。 'enqueue_many'的批量維度不應該需要靜態形狀信息（只要確保其餘維度具有靜態形狀信息）。 –

我被錯誤地接近整個問題。我錯誤地以爲我必須在插入tf.batch_shuffle時定義批次的完整形狀，但實際上我只需要定義我輸入的每個元素的形狀，並設置enqueue_many=True。

下面是正確的代碼：

single_batch=1 
input_batch_size = 64 
min_after_dequeue = 10 
capacity = min_after_dequeue + 3 * input_batch_size 
num_epochs=2 
SEQ_LEN = 10 
filenames =[outfilepath] 

fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True) 
reader = tf.TFRecordReader() 
key, serialized_example = reader.read(fq) 

# Dequeue examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time 
first_batch = tf.train.shuffle_batch([serialized_example], ONE, capacity, min_after_dequeue) 

# Get a single sentence and preprocess it shape=(sent_len) 
single_sentence = tf.parse_example(first_batch, features=feature_mapping) 

# Preprocess Sentence. shape=(sent_len, SEQ_LEN * 2). Each row is example 
processed_inputs = preprocess(single_sentence) 

# Re batch 
input_batch = tf.train.shuffle_batch([processed_inputs], 
       batch_size=input_batch_size, 
       capacity=capacity, min_after_dequeue=min_after_dequeue, 
       shapes=[SEQ_LEN * 2], enqueue_many=True) #<- This is the fix 


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables()) 

sess = create_session() 
sess.run(init_op) 

#tf.get_default_graph().finalize() 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 

try: 
    i = 0 
    while True: 
    print i  
    print sess.run(input_batch) 
    i += 1 
except tf.errors.OutOfRangeError as e: 
    print "No more inputs."

來源

2017-02-10 17:31:52 Neal

雙批處理Tensorflow輸入數據

回答

相關問題