2016-12-26 126 views
0

我有一個需求,我需要提取HTML標記之間的文本。我使用BeautifulSoup提取數據並將文本存儲到變量中以供進一步處理。後來我發現,我需要提取的文本來自兩個不同的標籤。但請注意,我需要提取文本並存儲到相同的變量中。我提供了我早期的代碼和示例HTML文本信息。請幫助我如何獲得最終結果,即預期產出。使用python湯提取動態HTML標記之間的文本

示例HTML文本:

<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 80 DOCUMENTS</SPAN></P> 
<DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Financial Times (London, England)</SPAN></P> 
<DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The Financial Times Ltd.<BR>All Rights Reserved<BR>Please do not cut and paste FT articles and redistribute by email or post to the web.</SPAN></P> 

<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">80 of 80 DOCUMENTS</SPAN></P> 
</DIV> 
<BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Financial Times (London,England)</SPAN></P> 
</DIV> 
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Copyright 1990 The Financial Times Limited</SPAN></P> 
</DIV> 

從上面的HTML文本,我需要(80 1的文件,80 80的文檔)文檔存儲到一個單一的變量。類似的其他文字也遵循類似的方法。我寫了div.c0的代碼

 soup = BeautifulSoup(response, 'html.parser') 
     docpublicationcpyright = soup.select('div.c0') 

     list1 = [b.text.strip() for b in docpublicationcpyright] 
     doccountvalues = list1[0:len(list1):3] 
     publicationvalues = list1[1:len(list1):3] 
     copyrightvalues = list1[2:len(list1):3] 
     documentcount = doccountvalues 

     publicationpaper = publicationvalues 

任何幫助將不勝感激。

+0

誰能幫我出這個 – Mho

+0

發佈您想要的輸出樣本。 –

+0

樣本輸出:documentcount = [80個文檔中的80個,80個文檔中的80個],publicationpaper = [Financial Times(London,England),Financial Times(London,England)] – Mho

回答

1

鑑於示例HTML結構不正確。例如:第一個DIV元素缺少結束標記。反正這種類型的HTML也使用正則表達式您可以刮取所需的數據。

我寫了一個示例代碼考慮張貼在這個問題&能夠提取所有三個必填字段

soup = BeautifulSoup(response, 'html.parser') 

documentElements = soup.find_all('span', text=re.compile(r'of [0-9]+ DOCUMENTS')) 
documentCountList = [] 
publicationPaperList = [] 
documentPublicationCopyrightList = [] 
for elem in documentElements: 
    documentCountList.append(elem.get_text().strip()) 
    if elem.parent.find_next_sibling('div'): 
     publicationPaperList.append(elem.parent.find_next_sibling('div').find('span').get_text().strip()) 
     documentPublicationCopyrightList.append(elem.parent.find_next_sibling('div').find_all('span')[1].get_text()) 
    else: 
     publicationPaperList.append(elem.parent.parent.find_next('div').get_text().strip()) 
     documentPublicationCopyrightList.append(elem.parent.parent.find_next('div').find_next('div').get_text().strip()) 

print(documentCountList) 
print(publicationPaperList) 
print(documentPublicationCopyrightList) 

輸出看起來像只樣本HTML下面

[u'1 of 80 DOCUMENTS', u'80 of 80 DOCUMENTS'] 
[u'Financial Times (London, England)', u'Financial Times (London,England)'] 
[u'Copyright 2015 The Financial Times Ltd.All Rights ReservedPlease do not cut and paste FT articles and redistribute by email or post to the web.', u'Copyright 1990 The Financial Times Limited'] 
+0

謝謝。它運行良好。 – Mho

相關問題