快速的方法來創建對

我有救喜歡這個詞/標籤對一個大文件的熊貓數據框：快速的方法來創建對

This/DT gene/NN called/VBN gametocide/NN

現在我希望把這些對與他們的計數這樣的數據幀：

 DT | NN -- 
This| 1 0 
Gene| 0 1 
:

我嘗試與計數對，然後把它在數據幀的字典這樣做：

file = open("data.txt", "r") 

train = file.read() 
words = train.split() 

data = defaultdict(int) 
for i in words: 
    data[i] += 1 

matrixB = pd.DataFrame() 

for elem, count in data.items(): 
    word, tag = elem.split('/') 
    matrixB.loc[tag, word] = count

但這需要很長時間（文件有300000個）。有沒有更快的方法來做到這一點？

來源

2016-03-01 maxmijn

從your other question得到的答案有什麼問題？

from collections import Counter 

with open('data.txt') as f: 
    train = f.read() 
c = Counter(tuple(x.split('/')) for x in train.split()) 
s = pd.Series(c) 
df = s.unstack().fillna(0) 

print(df)

產生

  DT NN VBN 
This   1 0 0 
called  0 0 1 
gametocide 0 1 0 
gene   0 1 0

來源

2016-03-01 17:53:49 Alex

什麼都沒有，只是仍在測試這一切之前，我看到你的答案。這幫了我很多，非常感謝！ – maxmijn

太棒了 - 很高興它有幫助！ – Alex

我以爲這個問題非常相似......你爲什麼發佈兩次？

from collection import Counter 

text = "This/DT gene/NN called/VBN gametocide/NN" 

>>> pd.Series(Counter(tuple(pair.split('/')) for pair in text.split())).unstack().fillna(0) 

      DT NN VBN 
This   1 0 0 
called  0 0 1 
gametocide 0 1 0 
gene   0 1 0

來源

2016-03-01 17:53:39 Alexander

快速的方法來創建對

回答

相關問題