使用Python + BeautifulSoup一前一後地提取標籤，創建列表列表

我對Python/BeautifulSoup有點新，並想知道我是否可以在如何完成以下方面獲得一些指導。使用Python + BeautifulSoup一前一後地提取標籤，創建列表列表

我有HTML代碼的網頁，其結構如下：的

1）含有包含所有圖像名稱（名稱1，NAME2 NAME3一個標籤內的代碼塊

2）嵌段包含在具有圖像網址的標籤內的代碼。

3）一個在網頁上出現的日期。我把它放到'date'變量中（這已經被提取出來）

從代碼中，我試圖提取包含[['image1'，'url1'，'date']的列表列表，['image2'，'url2'，'date']]，我將稍後轉換成字典（通過dict（zip（labels，values））函數），並插入到mysql表中。

我能想出的是如何提取兩個包含所有圖像和所有網址的列表。關於如何獲得我想要完成的任何想法？

幾件事情要記住：

1）圖像的數量總是變化，與名字（1沿着：1）

2）日期總是出現一次。

P.S.另外，如果有更優雅的方式通過bs4提取數據，請告訴我！

from bs4 import BeautifulSoup 
name = [] 
url = [] 
date = '2017-10-12' 

text = '<div class="tabs"> <ul><li> NAME1</li><li> NAME2</li><li> NAME3</li> </ul> <div><div><div class="img-wrapper"><img alt="" src="www.image1.com/1.jpg" title="image1.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/1.jpg); w.print();"> Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image2.com/2.jpg" title="image2.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image2.com/2.jpg"); w.print();">Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image1.com/3.jpg" title="image3.jpg"></img></div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/3.jpg"); w.print();"> Print</a> </center></div> </div></div>' 
soup = BeautifulSoup(text, 'lxml') 
#print soup.prettify() 
#get names 
for imgz in soup.find_all('div', attrs={'class':'img-wrapper'}): 
    for imglinks in imgz.find_all('img', src = True): 
     #print imgz 
     url.append((imglinks['src']).encode("utf-8")) 
#3 get ad URLS 
for ultag in soup.find_all('ul'): 
    for litag in ultag.find_all('li'): 
     name.append((litag.text).encode("utf-8")) #dump all urls into a list 
print url 
print name

來源

2017-10-12 FlyingZebra1

'值= [列表（I）+拉鍊[日期]爲我（名稱，網址）]'這樣的事情？ –

omg是的。你爲什麼不加區分地把這個置於評論中，而不是作爲答案？謝謝！ – FlyingZebra1

目前忙碌..但你可以回答並接受，可能對其他讀者有用。 –

下面是另一個可能的途徑來拉動的網址和名稱：

url = [tag.get('src') for tag in soup.find_all('img')] 
name = [tag.text.strip() for tag in soup.find_all('li')] 

print(url) 
# ['www.image1.com/1.jpg', 'www.image2.com/2.jpg', 'www.image1.com/3.jpg'] 

print(name) 
# ['NAME1', 'NAME2', 'NAME3']

至於最終的名單創作，這裏的東西是在功能上類似於@tmadam曾建議：

print([pair + [date] for pair in list(map(list, zip(url, name)))]) 
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], 
# ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], 
# ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

請注意，map是非常薄的現在不時使用，在some places中完全不鼓勵使用它。

或者：

n = len(url) 
print(list(map(list, zip(url, name, [date] * n)))) 
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

來源

2017-10-12 14:21:45

感謝您的替代bs4解決方案，以及回答原來的問題！看起來像我需要更多的練習來幫助我編寫更高效的bs4代碼。 – FlyingZebra1

使用Python + BeautifulSoup一前一後地提取標籤，創建列表列表

回答

相關問題