XPath不工作，因爲我期望它

希望你不需要在這裏的整套代碼，但我有一個問題，我解析HTML，使用XPath，我沒有得到我' d預計：XPath不工作，因爲我期望它

# here is the current set of tags I'm interested in 
html = '''<div style="padding-top: 10px; clear: both; width: 100%;"> 
     <a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" ><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/communities/discussion_boards/comment-sm._CB192250344_.gif" width="16" alt="Comment" hspace="3" align="absmiddle" height="16" border="0" /></a>&nbsp;<a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" >Comment</a>&nbsp;|&nbsp;<a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_cr_rdp_perm" >Permalink</a>'''

我試圖讓第一a標籤，這是一個長的URL的href值。爲此，我使用以下代碼

from lxml import etree 
import StringIO 

parser = etree.HTMLParser(encoding="utf-8") 
tree = etree.parse(StringIO.StringIO(html), parser) 

style = 'padding-top: 10px; clear: both; width: 100%;' 
xpath = "//div[@style='%s']" % style 
xpath += "/a[1]/@href" 

# use the XPath expression above to pull out the href value 
tree.xpath(xpath) 


['http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful']

這適用於當我拉出正在使用的零件並將其粘貼爲字符串時。這與我使用request.get()調用構建的tree完全不一樣，我無法弄清楚爲什麼？它返回的是：

['http://www.amazon.com/review/R41M1I2K413NG]

而我不明白爲什麼。我明白我在這裏黑暗中拍攝，但我只是希望有人遇到了「屬性截斷的XPath返回值」問題。

編輯：

下面是我目前使用的全部代碼，但它不工作。它返回上面的截斷值。

from lxml import etree 
import requests 
import StringIO 
from requests.packages.urllib3.util.retry import Retry 
from requests.adapters import HTTPAdapter 


session = requests.Session() 
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504]) 
session.mount('http://www.amazon.com', HTTPAdapter(max_retries=retries)) 
parser = etree.HTMLParser(encoding=encoding) 

url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview" 
page = session.get(url, timeout=5) 
tree = etree.parse(StringIO.StringIO(page.text), parser) 

style = 'padding-top: 10px; clear: both; width: 100%;' 
xpath = "//div[@style='%s']" % style 
xpath += "/a[1]/@href" 

# use the XPath expression above to pull out the href value 
tree.xpath(xpath)

編輯2：

這樣確實出於某種原因。而不是創建一個session對象，並使用提交get請求，然後傳遞到parser，只需將url字符串傳遞給parser作品：

url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview" 


tree = etree.parse(url, parser) 



for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"): 

    print e

據我瞭解，通過多個網址的循環時會話對象將持續加速進程的連接屬性。如果我使用etree.parse(url, parser)方法，我擔心我會失去效率。

來源

2016-07-29 Ryan Erwin

我們如何重現這一點？請向我們展示返回截斷屬性值的確切代碼。 – mzjn

調用'request.get（）'時，你使用的是什麼URL？ – Markus

http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview –

使用您提供的URL，下面的Python代碼：

url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview" 

from lxml import etree 
parser = etree.HTMLParser(encoding="utf-8") 
tree = etree.parse(url, parser) 

for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"): 

    print e

結果在下面的輸出：

> python ~/test.py 

http://www.amazon.com/review/RM8YYCQ57K2CL/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00J9PAZIO#wasThisHelpful 
http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful 
http://www.amazon.com/review/R3DT6VUDGIT9SK/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B000VYD0MA#wasThisHelpful 
http://www.amazon.com/review/RGFW1JM4151MW/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00TQQN5G0#wasThisHelpful 
http://www.amazon.com/review/R3I9FFX0MVF1BW/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0048A7NF8#wasThisHelpful 
http://www.amazon.com/review/R24TTSQY34VME8/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0115ZHH68#wasThisHelpful 
http://www.amazon.com/review/R3C49WWMNQZ007/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00ABAWHJ6#wasThisHelpful 
http://www.amazon.com/review/R37724EHW829NB/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B00TO5Y3FK#wasThisHelpful 
http://www.amazon.com/review/RQKGM5FRXVYSX/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B0051QUWKG#wasThisHelpful 
http://www.amazon.com/review/R1DW61PMGUDMDJ/ref=cm_aya_cmt/159-5911033-5890330?ie=UTF8&ASIN=B000N8Q2P6#wasThisHelpful

使用你提供的結果的示例代碼：

http://www.amazon.com/review/RM8YYCQ57K2CL 
http://www.amazon.com/review/R41M1I2K413NG 
http://www.amazon.com/review/R3DT6VUDGIT9SK 
http://www.amazon.com/review/RGFW1JM4151MW 
http://www.amazon.com/review/R3I9FFX0MVF1BW 
http://www.amazon.com/review/R24TTSQY34VME8 
http://www.amazon.com/review/R3C49WWMNQZ007 
http://www.amazon.com/review/R37724EHW829NB 
http://www.amazon.com/review/RQKGM5FRXVYSX 
http://www.amazon.com/review/R1DW61PMGUDMDJ

這是由於以下事實：HTML頁面中的任何URL都不會由session.get()h返回任何GET參數;或者是因爲在這種情況下服務器不返回帶有GET參數的URL，或者因爲requests剝離了GET參數。

來源

2016-07-29 13:55:26 Markus

是的，那正是我正在做的......我必須用新鮮的眼睛回到它。謝謝你的幫助。 –

因此，當我使用'etree.parse（url，parser）'時，它可以工作。但是，如果首先從'session.get（url）'獲取HTML並傳遞'text'屬性，比如'etree.parse（page.text，parser）'，那麼我得到的結果不正確。我想使用'session.get（）'b/c它有助於保持請求之間的連接。 –

XPath不工作，因爲我期望它

回答

相關問題