TFlearn - VocabularyProcessor忽略給定詞彙的部分

我正在使用TFlearn的VocabularyProcessor將文檔映射到整數數組。但是，我似乎無法用我自己的詞彙來初始化VocabularyProcessor。在文檔它說，創建VocabularyProcessor在的時候我可以提供一個詞彙：TFlearn - VocabularyProcessor忽略給定詞彙的部分

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length, vocabulary=vocab)

然而，這樣在創建VocabularyProcessor的時候，我不能正確轉換我的文檔。我提供的詞彙作爲字典，使用詞索引作爲值：

vocab={'hello':3, '.':5, 'world':20}

句子提供如下：

sentences = ['hello summer .', 'summer is here .', ...]

這是該VocabularyProcessor使用給定的指標來改造非常重要文件，因爲每個索引都引用某個單詞嵌入。當調用

list(vocab_processor.transform(['hello world .', 'hello']))

輸出是

[array([ 3, 20, 0]), array([3, 0, 0])]

於是判決被按照所提供的詞彙它映射未轉化的「」至5. 如何正確提供VocabularyProcessor的詞彙表？

來源

2017-09-29 Lemon

讓我們有一些實驗來回答你的問題，

vocab={'hello':3, '.':5, 'world':20, '/' : 10} 
sentences= ['hello world ./hello', 'hello'] 

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length=6, vocabulary=vocab) 
list(vocab_processor.transform(sentences))

下面的代碼段的輸出是，

[array([ 3, 20, 3, 0, 0, 0]), array([3, 0, 0, 0, 0, 0])]

現在你可能已經看到，空間（」「）和點（ '。'）兩者實際上沒有標記。所以在你的代碼中，發生的情況是tensorflow只能識別兩個單詞並填充一個額外的零以使其成爲max_document_length=3。要對它們執行標記，你可以編寫自己的tokenized function。示例代碼如下。

def my_func(iterator): 
    return (x.split(" ") for x in iterator) 

vocab={'hello':3, '.':5, 'world':20, '/' : 10} 
sentences= ['hello world ./hello', 'hello'] 

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length=6, vocabulary=vocab, tokenizer_fn = my_func) 

list(vocab_processor.transform(sentences))

現在代碼段的輸出是等

[array([ 3, 20, 5, 10, 3, 0]), array([3, 0, 0, 0, 0, 0])]

它是希望的輸出。希望這可以讓你清楚明白。

您的下一個混淆可能是默認情況下將被標記的值是什麼。讓我在這裏發佈的原始source，讓你永遠不能混爲一談，

TOKENIZER_RE = re.compile(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+", 
          re.UNICODE) 
def tokenizer(iterator): 
    """Tokenizer generator. 
    Args: 
    iterator: Input iterator with strings. 
    Yields: 
    array of tokens per each value in the input. 
    """ 
    for value in iterator: 
    yield TOKENIZER_RE.findall(value)

，但我的建議將是，「寫自己的功能和自信」

還，我想點如果你錯過了一些事情（希望不是）。如果您使用的是transform()函數，則您的min_frequency參數將無法正常工作，因爲它不適合數據。嘗試看看在下面的代碼的效果，

for i in range(6): 
    vocab_processor = learn.preprocessing.VocabularyProcessor(
     max_document_length=7, min_frequency=i) 
    tokens = vocab_processor.transform(["a b c d e f","a b c d e","a b c" , "a b", "a"]) 
    print(list(vocab_processor.transform(sentences))[0])

輸出：

[1 2 3 4 5 6 0] 
[1 2 3 4 5 6 0] 
[1 2 3 4 5 6 0] 
[1 2 3 4 5 6 0] 
[1 2 3 4 5 6 0] 
[1 2 3 4 5 6 0]

再次爲了輕微類似的代碼，

for i in range(6): 
    vocab_processor = learn.preprocessing.VocabularyProcessor(
     max_document_length=7, min_frequency=i) 
    tokens = vocab_processor.fit_transform(["a b c d e f","a b c d e","a b c" , "a b", "a"]) 
    print(list(tokens)[0])

輸出：

[1 2 3 4 5 6 0] 
[1 2 3 4 5 0 0] 
[1 2 3 0 0 0 0] 
[1 2 0 0 0 0 0] 
[1 0 0 0 0 0 0] 
[0 0 0 0 0 0 0]

來源

2017-10-30 04:22:25

這應該工作：

processor = learn.preprocessing.VocabularyProcessor(
    max_document_length=4, 
    vocabulary={'hello':2, 'world':20}) 

list(processor.transform(['world hello'])) 
>> [array([20, 2, 0, 0])]

注意這種方法的輸出形狀爲（1，max_document_length）。因此填充了最後兩個零。

更新：關於'。'在您的詞彙表中，我認爲它不會被處理器中的默認標記器識別爲標記（因此返回0）。默認的tokenizer使用一個非常簡單的Regex來做真正的工作（識別令牌）。看到它here。爲了解決這個問題，我想你應該爲VocabularyProcessor提供你自己的分詞器，方法是將4-th argument tokenizer_fn提供給它的構造器。

來源

2017-09-29 23:51:04 greeness

即正是我在做什麼。然而，具有翻譯時= { '你好'：3， ' '：5， '世界'：20}，並使用列表（processor.transform（['。世界你好'， '你好']））的輸出爲[數組（[3,20,0]），數組（[3,0,0]）]。所以這些句子並沒有按照提供的詞彙來映射''。到5 – Lemon

處理器使用的默認標記器可能不會將dot視爲有效標記。它適用於正常的單詞嗎？ – greeness

是的，它適用於單詞。然而，就我而言，標點符號也是一個「單詞」，因此應予以相應處理。 – Lemon

TFlearn - VocabularyProcessor忽略給定詞彙的部分

回答

相關問題