我有這個代碼,我一直在努力優化一段時間。如何避免for循環並正確地遍歷pandas數據框?
我的數據框是一個包含2列的csv文件,其中第二列包含文本。看起來像上的圖像:
我有一個函數總結(文本,N),需要一個單一的文本和一個整數作爲輸入。
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
所以總結()中的所有文本,我先通過我的數據幀進行迭代,並創建所有的文本,這是我後來又重複的名單通過一對一送他們到總結()函數,這樣我就可以獲取文本摘要。這些for循環讓我的代碼真的很慢,但我一直無法找到一種方法來提高效率,我非常感謝任何建議。
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
編輯: 其他兩個功能是:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
輸入文本:食譜很容易,狗愛他們。我會一次又一次地購買這本書。唯一的問題是,食譜並沒有告訴你他們做了多少次對待,但我想這是因爲你可以讓它們變成不同的尺寸。大買! 輸出文字:我會一次又一次購買這本書。
而不是此代碼,您可以提供一些文本和預期的輸出數據? –
你可能想看看pandas.DataFrame.apply –
'summarize()'調用另一個函數。你可以包含這個例子的輸入和輸出嗎? – roganjosh