2014-10-31 48 views
2

例如,我有網頁http://www.amazon.com/dp/1597805483如何在Python中的網頁上抓取腳本

我想用XPath來湊這句話Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

page = requests.get(url) 
tree = html.fromstring(page.text) 
feature_bullets = tree.xpath('//*[@id="iframeContent"]/div/text()') 
print feature_bullets 

沒有被上面的代碼返回。原因是由瀏覽器解釋的xpath與源代碼不同。但我不知道如何從源代碼獲取xpath。

回答

2

構建您網頁抓取的頁面涉及很多事情。

至於說明,具體而言,底層的HTML是一個JavaScript函數內部構造:

<script type="text/javascript"> 

    P.when('DynamicIframe').execute(function (DynamicIframe) { 
     var BookDescriptionIframe = null, 
       bookDescEncodedData = "%3Cdiv%3E%3CB%3EA%20Fantastic%20Anthology%20Combining%20the%20Love%20of%20Science%20Fiction%20with%20Our%20National%20Pastime%3C%2FB%3E%3CBR%3E%3CBR%3EOf%20all%20the%20sports%20played%20across%20the%20globe%2C%20none%20has%20more%20curses%20and%20superstitions%20than%20baseball%2C%20America%26%238217%3Bs%20national%20pastime.%3Cbr%3E%3CBR%3E%3CI%3EField%20of%20Fantasies%3C%2FI%3E%20delves%20right%20into%20that%20superstition%20with%20short%20stories%20written%20by%20several%20key%20authors%20about%20baseball%20and%20the%20supernatural.%20%20Here%20you%27ll%20encounter%20ghostly%20apparitions%20in%20the%20stands%2C%20a%20strangely%20charming%20vampire%20double-play%20combination%2C%20one%20fan%20who%20can%20call%20every%20shot%20and%20another%20who%20can%20see%20the%20past%2C%20a%20sad%20alternate-reality%20for%20the%20game%27s%20most%20famous%20player%2C%20unlikely%20appearances%20on%20the%20field%20by%20famous%20personalities%20from%20Stephen%20Crane%20to%20Fidel%20Castro%2C%20a%20hilariously%20humble%20teenage%20phenom%2C%20and%20much%20more.%20In%20this%20wonderful%20anthology%20are%20stories%20from%20such%20award-winning%20writers%20as%3A%3CBR%3E%3CBR%3EStephen%20King%20and%20Stewart%20O%26%238217%3BNan%3Cbr%3EJack%20Kerouac%3CBR%3EKaren%20Joy%20Fowler%3CBR%3ERod%20Serling%3CBR%3EW.%20P.%20Kinsella%3CBR%3EAnd%20many%20more%21%3CBR%3E%3CBR%3ENever%20has%20a%20book%20combined%20the%20incredible%20with%20great%20baseball%20fiction%20like%20%3CI%3EField%20of%20Fantasies%3C%2FI%3E.%20This%20wide-ranging%20collection%20reaches%20from%20some%20of%20the%20earliest%20classics%20from%20the%20pulp%20era%20and%20baseball%27s%20golden%20age%2C%20all%20the%20way%20to%20material%20appearing%20here%20for%20the%20first%20time%20in%20a%20print%20edition.%20Whether%20you%20love%20the%20game%20or%20just%20great%20fiction%2C%20these%20stories%20will%20appeal%20to%20all%2C%20as%20the%20writers%20in%20this%20anthology%20bring%20great%20storytelling%20of%20the%20strange%20and%20supernatural%20to%20the%20plate%2C%20inning%20after%20inning.%3CBR%3E%3C%2Fdiv%3E", 
       bookDescriptionAvailableHeight, 
       minBookDescriptionInitialHeight = 112, 
       options = {}; 
    ... 

</script> 

這裏的想法是拿到劇本標籤的文本,使用正則表達式提取描述值,所享有的HTML與lxml.html分析它,並得到了.text_content()

import re 
from urlparse import unquote 

from lxml import html 
import requests 

url = "http://rads.stackoverflow.com/amzn/click/1597805483" 
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'}) 
tree = html.fromstring(page.content) 

script = tree.xpath('//script[contains(., "bookDescEncodedData")]')[0] 
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text) 
if match: 
    description_html = html.fromstring(unquote(match.group(1))) 
    print description_html.text_content() 

打印:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime. 
Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. 
Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. 
In this wonderful anthology are stories from such award-winning writers as:Stephen King and Stewart O’NanJack KerouacKaren Joy FowlerRod SerlingW. P. KinsellaAnd many more!Never has a book combined the incredible with great baseball fiction like Field of Fantasies. 
This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning. 

類似的解決方案,但使用BeautifulSoup

import re 
from urlparse import unquote 

from bs4 import BeautifulSoup 
import requests 

url = "http://rads.stackoverflow.com/amzn/click/1597805483" 
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'}) 
soup = BeautifulSoup(page.content) 

script = soup.find('script', text=lambda x:'bookDescEncodedData' in x) 
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text) 
if match: 
    description_html = BeautifulSoup(unquote(match.group(1))) 
    print description_html.text 

或者,你可以採取一個高層次的方法,並使用一個真正的瀏覽器的selenium幫助:

from selenium import webdriver 

url = "http://rads.stackoverflow.com/amzn/click/1597805483" 

driver = webdriver.Firefox() 
driver.get(url) 

iframe = driver.find_element_by_id('bookDesc_iframe') 
driver.switch_to.frame(iframe) 

print driver.find_element_by_id('iframeContent').text 

driver.close() 

產生更好的格式化輸出:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime 

Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime. 

Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. In this wonderful anthology are stories from such award-winning writers as: 

Stephen King and Stewart O’Nan 
Jack Kerouac 
Karen Joy Fowler 
Rod Serling 
W. P. Kinsella 
And many more! 

Never has a book combined the incredible with great baseball fiction like Field of Fantasies. This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning. 
+0

您使用哪種工具來查找xpath – so3 2014-10-31 18:41:00

+0

@ so3 chrome開發人員工具和大腦開發人員工具:) xpath非常簡單,您可以看到 - 我只是檢查腳本標籤內的文本。 – alecxe 2014-10-31 18:43:26

+0

但是Chrome開發者工具沒有爲您提供原始源代碼 – so3 2014-10-31 20:20:04