2017-08-26 132 views
1

我有這個代碼,我一直在努力優化一段時間。如何避免for循環並正確地遍歷pandas數據框?

我的數據框是一個包含2列的csv文件,其中第二列包含文本。看起來像上的圖像:

enter image description here

我有一個函數總結(文本,N),需要一個單一的文本和一個整數作爲輸入。

def summarize(text, n): 
sents = sent_tokenize(text) # text into tokenized sentences 
# Checking if there are less sentences in the given review than the required length of the summary 
assert n <= len(sents) 
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences 
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences 
ranking = defaultdict(int) 
for i, sent in enumerate(list_sentences): 
    for w in sent: 
     if w in frequency: 
      ranking[i] += frequency[w] 
# Calling the rank function to get the highest ranking 
sents_idx = rank(ranking, n) 
# Return the best choices 
return [sents[j] for j in sents_idx] 

所以總結()中的所有文本,我先通過我的數據幀進行迭代,並創建所有的文本,這是我後來又重複的名單通過一對一送他們到總結()函數,這樣我就可以獲取文本摘要。這些for循環讓我的代碼真的很慢,但我一直無法找到一種方法來提高效率,我非常感謝任何建議。

data = pd.read_csv('dataframe.csv') 

text = data.iloc[:,2] # ilocating the texts 
list_of_strings = [] 
for t in text: 
    list_of_strings.append(t) # creating a list of all the texts 

our_summary = [] 
for s in list_of_strings: 
    for f in summarize(s, 1): 
     our_summary.append(f) 

ours = pd.DataFrame({"our_summary": our_summary}) 

編輯: 其他兩個功能是:

def calculate_freq(list_sentences): 
frequency = defaultdict(int) 
for sentence in list_sentences: 
    for word in sentence: 
     if word not in our_stopwords: 
      frequency[word] += 1 

# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized) 
if frequency.values(): 
    max_word = float(max(frequency.values())) 
else: 
    max_word = 1 
for w in frequency.keys(): 
    frequency[w] = frequency[w]/max_word # normalize 
    if frequency[w] <= min_freq or frequency[w] >= max_freq: 
     del frequency[w] # filter 
return frequency 


def rank(ranking, n): 
    # return n first sentences with highest ranking 
    return nlargest(n, ranking, key=ranking.get) 

輸入文本:食譜很容易,狗愛他們。我會一次又一次地購買這本書。唯一的問題是,食譜並沒有告訴你他們做了多少次對待,但我想這是因爲你可以讓它們變成不同的尺寸。大買! 輸出文字:我會一次又一次購買這本書。

+0

而不是此代碼,您可以提供一些文本和預期的輸出數據? –

+0

你可能想看看pandas.DataFrame.apply –

+0

'summarize()'調用另一個函數。你可以包含這個例子的輸入和輸出嗎? – roganjosh

回答

1

你有沒有試過類似的東西?

# Test data 
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']}) 

# Example function 
def summarize(text, n=5): 

    """A very basic summary""" 
    return (text[:n] + '..') if len(text) > n else text 

# Applying the function to the text 
df['Result'] = df['Summary'].map(summarize) 

# ASIN     Summary Result 
# 0  0 This is the first text This .. 
# 1  1    Second text Secon.. 
+0

工程就像一個魅力。謝謝一堆。 – saremisona

0

這麼長的故事...

我要去承擔,因爲你是執行文本頻率分析,reviewText順序並不重要。如果是這樣的話:

Mega_String = ' '.join(data['reviewText']) 

這應該Concat的回顧文本功能的所有字符串成一個大的字符串,每次審查以空格隔開。

你可以把這個結果放到你的函數中。