2017-05-30 162 views
2

我正在嘗試檢查字符串是否在Pandas列中。我嘗試了兩種方式,但他們似乎都檢查了一個子字符串。檢查字符串是否在pandas Dataframe列中,並創建新的Dataframe

itemName = "eco drum ecommerce" 
words = self.itemName.split(" ") 
df.columns = ['key','word','umbrella', 'freq'] 
df = df.dropna() 
df = df.loc[df['word'].isin(words)] 

我也試過這種方式,但這還檢查子

words = self.itemName.split(" ") 
words = '|'.join(words) 
df.columns = ['key','word','umbrella', 'freq'] 
df = df.dropna() 
df = df.loc[df['word'].str.contains(words, case=False)] 

這個詞是這樣的:"eco drum"

然後我做了這一點:

words = self.itemName.split(" ") 
words = '|'.join(words) 

要這樣結束了:

eco|drum 

這是"word"列:

enter image description here

謝謝你,有沒有可能這種方式不匹配子字符串?

回答

1

你有正確的想法。 .contains默認情況下將正則表達式模式匹配選項設置爲True。因此,您需要做的就是將錨定添加到您的正則表達式模式中,例如"ball"將變爲"^ball$"

df = pd.DataFrame(columns=['key']) 
df["key"] = ["largeball", "ball", "john", "smallball", "Ball"] 
print(df.loc[df['key'].str.contains("^ball$", case=False)]) 

更具體地參考您的問題,因爲你要搜索多個單詞,你必須創建正則表達式模式給予contains

# Create dataframe 
df = pd.DataFrame(columns=['word']) 
df["word"] = ["ecommerce", "ecommerce", "ecommerce", "ecommerce", "eco", "drum"] 
# Create regex pattern 
word = "eco drum" 
words = word.split(" ") 
words = "|".join("^{}$".format(word) for word in words) 
# Find matches in dataframe 
print(df.loc[df['word'].str.contains(words, case=False)]) 

代碼words = "|".join("^{}$".format(word) for word in words)被稱爲生成器表達式。鑑於['eco', 'drum']它將返回此模式:^eco$|^drum$

+0

嘿@ the-realtom,現在不在我的桌面上,所以我會嘗試它,當我回家。所以你說,在這種情況下,正則表達式模式是一個變量,我會做這樣的事情 df = df.loc [df ['word']。str.contains(「^ words $」,case = False )] 謝謝,看來,這是正確的軌道 – PythonRookie

+0

hey @ the-realtom 我試過這樣做,但新的熊貓數據框爲空 df = df.loc [df ['word']。 str.contains('^ words $',case = False)] – PythonRookie

+0

我更新了我的答案,我認爲單詞是一個單詞的字符串? –

相關問題