2017-08-02 74 views
2

我跟着關於JavaScript刮痧很多教程,但我真的不能設法把號碼的開出,從這個表:動態文本刮

http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html

我嘗試了最後一個Sentdex教程使用此代碼:

import bs4 as bs 
import sys 
import urllib.request 
from PyQt5.QtWebEngineWidgets import QWebEnginePage 
from PyQt5.QtWidgets import QApplication 
from PyQt5.QtCore import QUrl 

class Page(QWebEnginePage): 
    def __init__(self, url): 
     self.app = QApplication(sys.argv) 
     QWebEnginePage.__init__(self) 
     self.html = '' 
     self.loadFinished.connect(self._on_load_finished) 
     self.load(QUrl(url)) 
     self.app.exec_() 

    def _on_load_finished(self): 
     self.html = self.toHtml(self.Callable) 
     print('Load finished') 

    def Callable(self, html_str): 
     self.html = html_str 
     self.app.quit() 


def main(): 
    page = Page('http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html') 
    soup = bs.BeautifulSoup(page.html, 'html.parser') 
    tableSup = soup.find_all("td",{"class": "col2 yellowBack"}) 
    print(tableSup) 

if __name__ == '__main__': main() 

它看起來像我出的目標......大家說話總是與那些出現在網頁源代碼,但隨後在美麗的湯標籤文本消失文本相關的腳本,但我可以」真的找到腳本的屁股與上面的頁面主表中的值相關聯?

任何關於我應該指導我的研究的建議?

回答

2

注意你要刮的表是在iframe裏面,你應該對這個iframe做一個請求,然後繼續刮表。通過對元素的簡單檢查發現了iframe網址。使用requests一個例子代碼如下所示:

from bs4 import BeautifulSoup 
import requests 

iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd" 
html = requests.get(iframe).text 
soup = BeautifulSoup(html,'html.parser') 

column = soup.findAll("td",{"class": "col2 yellowBack"}) 
values = [row.string for row in column] 

看起來你有興趣從該列中的值,因此values是所需的輸出:

>>> values 
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39'] 
+0

太棒了!非常感謝。我注意到