如何避免for循環並正確地遍歷pandas數據框？

我有這個代碼，我一直在努力優化一段時間。如何避免for循環並正確地遍歷pandas數據框？

我的數據框是一個包含2列的csv文件，其中第二列包含文本。看起來像上的圖像：

我有一個函數總結（文本，N），需要一個單一的文本和一個整數作爲輸入。

def summarize(text, n): 
sents = sent_tokenize(text) # text into tokenized sentences 
# Checking if there are less sentences in the given review than the required length of the summary 
assert n <= len(sents) 
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences 
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences 
ranking = defaultdict(int) 
for i, sent in enumerate(list_sentences): 
    for w in sent: 
     if w in frequency: 
      ranking[i] += frequency[w] 
# Calling the rank function to get the highest ranking 
sents_idx = rank(ranking, n) 
# Return the best choices 
return [sents[j] for j in sents_idx]

所以總結（）中的所有文本，我先通過我的數據幀進行迭代，並創建所有的文本，這是我後來又重複的名單通過一對一送他們到總結（）函數，這樣我就可以獲取文本摘要。這些for循環讓我的代碼真的很慢，但我一直無法找到一種方法來提高效率，我非常感謝任何建議。

data = pd.read_csv('dataframe.csv') 

text = data.iloc[:,2] # ilocating the texts 
list_of_strings = [] 
for t in text: 
    list_of_strings.append(t) # creating a list of all the texts 

our_summary = [] 
for s in list_of_strings: 
    for f in summarize(s, 1): 
     our_summary.append(f) 

ours = pd.DataFrame({"our_summary": our_summary})

編輯：其他兩個功能是：

def calculate_freq(list_sentences): 
frequency = defaultdict(int) 
for sentence in list_sentences: 
    for word in sentence: 
     if word not in our_stopwords: 
      frequency[word] += 1 

# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized) 
if frequency.values(): 
    max_word = float(max(frequency.values())) 
else: 
    max_word = 1 
for w in frequency.keys(): 
    frequency[w] = frequency[w]/max_word # normalize 
    if frequency[w] <= min_freq or frequency[w] >= max_freq: 
     del frequency[w] # filter 
return frequency 


def rank(ranking, n): 
    # return n first sentences with highest ranking 
    return nlargest(n, ranking, key=ranking.get)

輸入文本：食譜很容易，狗愛他們。我會一次又一次地購買這本書。唯一的問題是，食譜並沒有告訴你他們做了多少次對待，但我想這是因爲你可以讓它們變成不同的尺寸。大買！輸出文字：我會一次又一次購買這本書。

來源

2017-08-26 saremisona

而不是此代碼，您可以提供一些文本和預期的輸出數據？ –

你可能想看看pandas.DataFrame.apply –

'summarize（）'調用另一個函數。你可以包含這個例子的輸入和輸出嗎？ – roganjosh

你有沒有試過類似的東西？

# Test data 
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']}) 

# Example function 
def summarize(text, n=5): 

    """A very basic summary""" 
    return (text[:n] + '..') if len(text) > n else text 

# Applying the function to the text 
df['Result'] = df['Summary'].map(summarize) 

# ASIN     Summary Result 
# 0  0 This is the first text This .. 
# 1  1    Second text Secon..

來源

2017-08-26 17:10:46 Romain

工程就像一個魅力。謝謝一堆。 – saremisona

這麼長的故事...

我要去承擔，因爲你是執行文本頻率分析，reviewText順序並不重要。如果是這樣的話：

Mega_String = ' '.join(data['reviewText'])

這應該Concat的回顧文本功能的所有字符串成一個大的字符串，每次審查以空格隔開。

你可以把這個結果放到你的函數中。

來源

2017-08-26 17:16:35 Yeile

如何避免for循環並正確地遍歷pandas數據框？

回答

相關問題