1
我在執行dryscrape和ubuntu 16.04服務器(乾淨安裝在數字海洋上)時出現問題 - 目的是刮掉JS人口稠密的網站。drysrape安裝Ubuntu服務器16.04
我下面dryscrape從here安裝說明:
apt-get update
apt-get install qt5-default libqt5webkit5-dev build-essential \
python-lxml python-pip xvfb
pip install dryscrape
,然後運行下面的Python腳本,我發現here以及在同一鏈路測試HTML頁面。 (它返回HTML或JS)
的Python
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
my_url = 'http://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
HTML - scrape.php
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
當我這樣做,我似乎無法得到預期的回報數據,相反,它只是錯誤。
我想知道是否有任何明顯的我失蹤?
注意:我瀏覽了很多安裝指南/線程,似乎無法使其工作。我也嘗試過使用硒,但似乎也無法使用硒。非常感謝。
輸出
Traceback (most recent call last):
File "js.py", line 3, in <module>
session = dryscrape.Session()
File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 22, in __init__
self.driver = driver or DefaultDriver()
File "/usr/local/lib/python2.7/dist-packages/dryscrape/driver/webkit.py", line 30, in __init__
super(Driver, self).__init__(**kw)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 230, in __init__
self.conn = connection or ServerConnection()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 507, in __init__
self._sock = (server or get_default_server()).connect()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 450, in get_default_server
_default_server = Server()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 424, in __init__
raise NoX11Error("Could not connect to X server. "
webkit_server.NoX11Error: Could not connect to X server. Try calling dryscrape.start_xvfb() before creating a session.
工作腳本
import dryscrape
from bs4 import BeautifulSoup
dryscrape.start_xvfb()
session = dryscrape.Session()
my_url = 'https://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response, "html.parser")
print soup.find(id="intro-text").text
感謝這個,我在更新/工作python腳本已經添加到我的答案的底部。我只需要添加額外的東西就是在'soup = BeautifulSoup(response,「html.parser」)內指定html解析器,因爲我花了4個小時閱讀並試圖在昨天解決問題,所以我非常欣賞這個幫助。 – denski