在將字典鍵分配給匹配行時將數據框與字典值進行過濾？

我有一個「鏈接」列的數據框，其中包含幾千個在線文章的網址。每個觀察都有一個URL。在將字典鍵分配給匹配行時將數據框與字典值進行過濾？

urls_list = ['http://www.ajc.com/news/world/atlan...', 
      'http://www.seattletimes.com/sports/...', 
      'https://www.cjr.org/q_and_a/washing...', 
      'https://www.washingtonpost.com/grap...', 
      'https://www.nytimes.com/2017/09/01/...', 
      'http://www.oregonlive.com/silicon-f...'] 

df = pd.DataFrame(urls_list,columns=['Links'])

我另外有一個包含出版物名稱密鑰和域名爲值的字典。

urls_dict = dict({'Atlanta Journal-Constitution':'ajc.com', 
        'The Washington Post':'washingtonpost.com', 
        'The New York Times':'nytimes.com'})

我想過濾的數據幀只得到這些意見在「鏈接」欄包含在字典中的值域，而同時在字典鍵分配出版物名稱新專欄「出版物」。我設想的是使用下面的代碼來創建「發佈」列，然後從該列中刪除None以在事實之後過濾數據幀。

pub_list = [] 

for row in df['Links']: 
    for k,v in urls_dict.items(): 
     if row.find(v) > -1: 
      publication = k 
     else: 
      publication = None 
     pub_list.append(publication)

但是，列表pub_list，我得到的回報 - 雖然看起來做什麼，我打算 - 是三倍長爲我的數據幀。有人可以建議如何解決上述代碼？或者，或者，建議一個更清潔的解決方案，它既可以（1）使用字典值（域名）過濾數據框的「鏈接」列，同時（2）創建字典鍵（發佈名稱）的新「發佈」？（請注意，df與只有一個簡潔列在這裏創建，實際文件將有許多列，因此我必須能夠指定要過濾的哪一列。）

編輯：我想給出一些澄清的給出RagingRoosevelt的答案。我想避免使用合併，因爲某些域可能不完全匹配。例如，與ajc.com我也想能夠捕獲myajc.com，並與washingtonpost.com我想要獲得像live.washingtonpost.com這樣的子域。因此，我希望在str.contains()，find()或in運算符中找到一種類型的「在字符串中查找子字符串」解決方案。

來源

2017-09-01 dmitriys

我可以使用嵌套的字典解析（以及或者使用嵌套列表理解），還有一些額外的數據幀操作來清理列落空行來弄明白。

使用嵌套字典解析（或者更具體地，嵌套在字典解析列表解析的內側）：

df['Publication'] = [{k: k for k,v in urls_dict.items() if v in row} for row in df['Links']] 

# Format the 'Publication' column to get rid of duplicate 'key' values 
df['Publication'] = df['Publication'].astype(str).str.strip('{}').str.split(':',expand=True)[0] 

# Remove blank rows from 'Publication' column 
df = df[df['Publication'] != '']

類似地，使用一個嵌套列表理解：

# First converting dict to a list of lists 
urls_list_of_lists = list(map(list,urls_dict.items())) 

# Nested list comprehension using 'in' operator 
df['Publication'] = [[item[0] for item in urls_list_of_lists if item[1] in row] for row in df['Links']] 

# Format the 'Publication' column to get rid of duplicate brackets 
df['Publication'] = df['Publication'].astype(str).str.strip('[]') 

# Remove blank rows from 'Publication' column 
df = df[df['Publication'] != '']

來源

2017-09-26 17:16:01 dmitriys

這裏就是我想要做的：

使用DataFrame.apply一個新列添加到您的數據幀僅包含域。
使用DataFrame.merge（與how='inner'選項）合併你的域名字段上的兩個數據幀。

這是一個有點髒使用循環做的東西dataframes如果他們只是迭代列或行，一般有，做同樣的事情更清潔一個數據幀的方法。

如果你願意，我可以用例子來擴展它。

編輯這是看起來像什麼。請注意，我使用相當可怕的正則表達式來捕獲域。

def domain_extract(row): 
    s = row['Links'] 
    p = r'(?:(?:\w+)?(?::\/\/)(?:www\.)?)?([A-z0-9.]+)\/.*' 
    m = re.match(p, s) 
    if m is not None: 
     return m.group(1) 
    else: 
     return None 

df['Domain'] = df.apply(domain_extract, axis=1) 

dfo = pd.DataFrame({'Name': ['Atlanta Journal-Constitution', 'The Washington Post', 'The New York Times'], 'Domain': ['ajc.com', 'washingtonpost.com', 'nytimes.com']}) 

df.merge(dfo, on='Domain', how='inner')[['Links', 'Domain', 'Name']]

來源

2017-09-01 21:08:24 RagingRoosevelt

謝謝。這工作，但我想避免使用'merge'，因爲一些域可能不是_exact_匹配。例如，使用'ajc.com'，我也希望能夠捕獲'myajc.com'，並使用'washingtonpost.com'，我想要像'live.washingtonpost.com'這樣的子域名好。因此，我希望找到一種類型的「在字符串中查找子字符串」解決方案w /'str.contains（）'或'find（）'來增加靈活性。 – dmitriys

看起來應該可以做模糊匹配 https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas – RagingRoosevelt

在將字典鍵分配給匹配行時將數據框與字典值進行過濾？

回答

相關問題