美麗的湯不颳去所有可見的網站數據（Python 3）

我的問題是，我試圖刮一堆不同的網站，找到所有可見的文本下載到.txt文件 - 不幸的是，我不是從這些網站獲取所有可能的文本。我已經發布的低於我的代碼工作的例子：美麗的湯不颳去所有可見的網站數據（Python 3）

import requests 
from bs4 import BeautifulSoup 
from collections import Counter 


urls = ['https://www304.americanexpress.com/credit-card/compare'] 

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile: 
    for url in urls: 
     website = requests.get(url) 
     soup = BeautifulSoup(website.content) 
     text = [''.join(s.findAll(text=True))for s in soup.findAll('p')] 
     for item in text: 
      print(item, file=outfile)

如果測試出這些代碼，你得到的是以下數據 -

Ratings & Reviews for this card are currently not available 
Ratings & Reviews for this card are currently not available 
Ratings & Reviews for this card are currently not available 
All users of our online services subject to Privacy Statement and agree to be bound by Terms of etc...

究竟如何獲得的休息此頁面上的可見數據？基於我的研究，我很確定它與我的soup.findAll（'p'）參數有關，但我不知道要添加什麼來獲取其餘數據。

來源

2014-09-02 user3682157

而是尋找段落，從body得到.text：

print(soup.body.text, file=outfile)

如果你想避免script標籤內容被寫入到結果，你可以找到頂級的所有標籤（見recursive=False ）並加入文本：

print(''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))

來源

2014-09-02 04:35:44 alecxe

嗨Alecx，我想到了，但這給了我所有的數據在頁面上，它的許多沒用（即如果（NAV == null || typeof（NAV）== 「undefined」）{var NAV = new Object（）} NAV.RWD = {body：document.g etElementsByTagName） - 這兩種方法之間有妥協嗎？ – user3682157 2014-09-02 04:45:49

@ user3682157好吧，對，但是您不能輕鬆可靠地查看某個元素是否「可見」或未使用Beautifulsoup。至少可以跳過「script」標籤。或者，你可以切換到硒，這將真正知道什麼是可見的，什麼是不可見的。 – alecxe 2014-09-02 04:50:53

@ user3682157我已更新答案，包括跳過'script'標籤內容。 – alecxe 2014-09-02 11:55:35

美麗的湯不颳去所有可見的網站數據（Python 3）

回答

相關問題