有沒有一種更簡單的方法來從字符串建立字典，然後矢量化字符串？ Python的

我就如何建立從字符串字典的問題有點多語言/ NLP傾向於比Creating a dictionary from a string 有沒有一種更簡單的方法來從字符串建立字典，然後矢量化字符串？ Python的

鑑於串句的列表，有沒有更簡單的建立一個獨特的詞典，然後向量化方式字符串句子？我知道有外部庫這樣做像gensim但我想避免它們。我一直在做這樣說：

from itertools import chain 

def getKey(dic, value): 
    return [k for k,v in sorted(dic.items()) if v == value] 

# Vectorize will return a list of tuples and each tuple is made up of 
# (<position of word in dictionar>,<number of times it occurs in sentence>) 
def vectorize(sentence, dictionary): # is there simpler way to do this? 
    vector = [] 
    for word in sentence.split(): 
    word_count = sentence.lower().split().count(word) 
    dic_pos = getKey(dictionary, word)[0] 
    vector.append((dic_pos,word_count)) 
    return vector 

s1 = "this is is a foo" 
s2 = "this is a a bar" 
s3 = "that 's a foobar" 

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this? 
dictionary = {} 
for i in range(len(uniq)): # can this be done with dict(list_comprehension)? 
    dictionary[i] = uniq[i] 

v1 = vectorize(s1, dictionary) 
v2 = vectorize(s2, dictionary) 
v3 = vectorize(s3, dictionary) 

print v1 
print v2 
print v3

來源

2013-03-13 alvas

我不知道你的最終目標是什麼，但我可以告訴你以下幾個問題：你做一個**集**，你變成了一個**列表**，然後你變成一個**字典**並繼續從**字典**而不是**鍵**中查找**值**，並且它們都是來自**列表**的位置結果，您可以爲每個查詢構建**！ ** – 2013-03-14 00:10:05

這裏：

from itertools import chain, count 

s1 = "this is is a foo" 
s2 = "this is a a bar" 
s3 = "that 's a foobar" 

# convert each sentence into a list of words, because the lists 
# will be used twice, to build the dictionary and to vectorize 
w1, w2, w3 = all_ws = [s.split() for s in [s1, s2, s3]] 

# chain the lists and turn into a set, and then a list, of unique words 
index_to_word = list(set(chain(*all_ws))) 

# build the inverse mapping of index_to_word, by pairing it with a counter 
word_to_index = dict(zip(index_to_word, count())) 

# create the vectors of word indices and of word count for each sentence 
v1 = [(word_to_index[word], w1.count(word)) for word in w1] 
v2 = [(word_to_index[word], w2.count(word)) for word in w2] 
v3 = [(word_to_index[word], w3.count(word)) for word in w3] 

print v1 
print v2 
print v3

事情要記住：

字典應該只從關鍵值遊歷;如果您需要做相反的事情，則創建（並更新）兩個詞典，一個是另一個詞典的反向映射，就像我上面做的那樣;
如果你需要一個字典的鍵是連續的整數，只需使用一個列表（謝謝傑夫）;
從不計算兩次相同的東西！（請參閱句子的split（）版本）如果您稍後需要將其保存在變量中;
只要有可能就使用列表解析，以獲得更好的性能，簡潔性和可讀性。

來源

2013-03-14 00:19:41 Tobia

+1對集合和字典構建的一些偉大Pythonic示範。 – 2013-03-14 00:25:33

index_to_word應該是一個列表，因爲我們知道它們都是位置的。更好的內存和查找時間，查找語法相同。 – 2013-03-14 00:29:56

好趕上！謝謝 – Tobia 2013-03-14 00:34:08

，如果你試圖計算一個句子中的單詞的出現次數，使用collections.Counter

問題與您的代碼：

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this? 
dictionary = {} 
for i in range(len(uniq)): # can this be done with dict(list_comprehension)? 
    dictionary[i] = uniq[i]

以上部分所做的僅僅是創建一個由任意數字索引的字典（它來自迭代沒有索引概念的set）。 你的鑰匙，而不是值做查找：那麼上面的字典使用

def getKey(dic, value): return [k for k,v in sorted(dic.items()) if v == value]

這個功能，這也完全忽略了字典的精神訪問。

也，vectorize的想法還不清楚。你想通過這個功能來實現什麼？你問了一個簡單版本的vectorize，但沒有告訴我們它做了什麼。

來源

2013-03-13 23:58:01 thkang

你的代碼中有多個問題，所以讓我們逐個回答它們。

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?

一方面，它可能是，而不是將它們連接在一起，然後拆分結果概念比較簡單（雖然只是作爲詳細），split()的獨立字符串。

uniq = list(set(chain(*map(str.split, (s1, s2, s3))))

除此之外：它看起來像你總是使用這些詞列表，而不是實際的句子，讓你在多個地方分裂。爲什麼不把它們一次全部分開，放在最上面？

同時，不必明確地通過s1,s2和s3，爲什麼不把它們粘在集合中呢？你也可以將結果保存在一個集合中。

所以：

sentences = (s1, s2, s3) 
wordlists = [sentence.split() for sentence in sentences] 

uniq = list(set(chain.from_iterable(wordlists))) 

# ... 

vectors = [vectorize(sentence, dictionary) for sentence in sentences] 
for vector in vectors: 
    print vector

dictionary = {} 
for i in range(len(uniq)): # can this be done with dict(list_comprehension)? 
    dictionary[i] = uniq[i]

你可以做到這一點作爲dict()上的列表理解，但是，更簡單地說，使用字典理解。並且，在您使用時，請使用enumerate而不是for i in range(len(uniq))位。

dictionary = {idx: word for (idx, word) in enumerate(uniq)}

這取代了整個# ...部分在上面。

同時，如果你想有一個反向字典查找，這是不是做的方式：

def getKey(dic, value): 
    return [k for k,v in sorted(dic.items()) if v == value]

相反，創建一個逆字典，映射值的鍵列表。

def invert_dict(dic): 
    d = defaultdict(list) 
    for k, v in dic.items(): 
     d[v].append(k) 
    return d

然後，而不是getKey函數，只需在倒數字典中進行正常查找。

如果您需要替換修改和查找，您可能需要某種雙向字典，該字典可以管理自己的反向字典。在ActiveState上有很多這樣的配方，PyPI上可能有一些模塊，但這並不難。無論如何，你似乎並不需要這裏。

最後，還有你的vectorize函數。

如上所述，第一件要做的事就是將一個單詞列表而不是一個句子分開。

而且沒有理由在lower之後重新拆分句子;只需在單詞列表中使用映射或生成器表達式即可。

事實上，我不確定爲什麼你在這裏做lower，當你的字典是由原始版本構建的。我猜這是一個錯誤，而且在構建字典時你也想做lower。這就是使得這些詞列表提前在一個單一的，易於查找的位置的優勢之一：你只需要改變一個行：

wordlists = [sentence.lower().split() for sentence in sentences]

現在你已經是一個有點簡單：

def vectorize(wordlist, dictionary): 
    vector = [] 
    for word in wordlist: 
     word_count = wordlist.count(word) 
     dic_pos = getKey(dictionary, word)[0] 
     vector.append((dic_pos,word_count)) 
    return vector

同時，您可能會認識到vector = []… for word in wordlist… vector.append正是列表理解的內容。但是，如何將三行代碼轉換爲列表理解？簡單：將其重構爲一個函數。所以：

def vectorize(wordlist, dictionary): 
    def vectorize_word(word): 
     word_count = wordlist.count(word) 
     dic_pos = getKey(dictionary, word)[0] 
     return (dic_pos,word_count) 
    return [vectorize_word(word) for word in wordlist]

來源

2013-03-14 00:07:45 abarnert

好吧，它看起來像你想：

返回每個令牌的位置值的字典。
一個令牌在一個集合中被找到的次數。

，你可以：

import bisect 

uniq.sort() #Sort it since order didn't seem to matter 

def getPosition(value): 
    position = bisect.bisect_left(uniq, value) #Do a log(n) query 
    if uniq[position] != value: 
     raise IndexError

要爲O搜索（n）的時間，你可以改爲創建您設置和反覆插入帶有連續鍵中的每個值。這在內存上效率要低得多，但是它提供了一個散列的O（n）搜索...而Tobia在寫代碼時發佈了一個很好的代碼示例，所以請參閱這個答案。

來源

2013-03-14 00:24:56

有沒有一種更簡單的方法來從字符串建立字典，然後矢量化字符串？ Python的

回答

相關問題