希望你不需要在這裏的整套代碼,但我有一個問題,我解析HTML,使用XPath,我沒有得到我' d預計:XPath不工作,因爲我期望它
# here is the current set of tags I'm interested in
html = '''<div style="padding-top: 10px; clear: both; width: 100%;">
<a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" ><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/communities/discussion_boards/comment-sm._CB192250344_.gif" width="16" alt="Comment" hspace="3" align="absmiddle" height="16" border="0" /></a> <a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful" >Comment</a> | <a href="http://www.amazon.com/review/R41M1I2K413NG/ref=cm_cr_rdp_perm" >Permalink</a>'''
我試圖讓第一a
標籤,這是一個長的URL的href
值。爲此,我使用以下代碼
from lxml import etree
import StringIO
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse(StringIO.StringIO(html), parser)
style = 'padding-top: 10px; clear: both; width: 100%;'
xpath = "//div[@style='%s']" % style
xpath += "/a[1]/@href"
# use the XPath expression above to pull out the href value
tree.xpath(xpath)
['http://www.amazon.com/review/R41M1I2K413NG/ref=cm_aya_cmt?ie=UTF8&ASIN=B013IZY7RU#wasThisHelpful']
這適用於當我拉出正在使用的零件並將其粘貼爲字符串時。這與我使用request.get()
調用構建的tree
完全不一樣,我無法弄清楚爲什麼?它返回的是:
['http://www.amazon.com/review/R41M1I2K413NG]
而我不明白爲什麼。我明白我在這裏黑暗中拍攝,但我只是希望有人遇到了「屬性截斷的XPath返回值」問題。
編輯:
下面是我目前使用的全部代碼,但它不工作。它返回上面的截斷值。
from lxml import etree
import requests
import StringIO
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
session = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
session.mount('http://www.amazon.com', HTTPAdapter(max_retries=retries))
parser = etree.HTMLParser(encoding=encoding)
url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"
page = session.get(url, timeout=5)
tree = etree.parse(StringIO.StringIO(page.text), parser)
style = 'padding-top: 10px; clear: both; width: 100%;'
xpath = "//div[@style='%s']" % style
xpath += "/a[1]/@href"
# use the XPath expression above to pull out the href value
tree.xpath(xpath)
編輯2:
這樣確實出於某種原因。而不是創建一個session
對象,並使用提交get
請求,然後傳遞到parser
,只需將url
字符串傳遞給parser
作品:
url = "http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview"
tree = etree.parse(url, parser)
for e in tree.xpath("//div[@style='padding-top: 10px; clear: both; width: 100%;']/a[1]/@href"):
print e
據我瞭解,通過多個網址的循環時會話對象將持續加速進程的連接屬性。如果我使用etree.parse(url, parser)
方法,我擔心我會失去效率。
我們如何重現這一點?請向我們展示返回截斷屬性值的確切代碼。 – mzjn
調用'request.get()'時,你使用的是什麼URL? – Markus
http://www.amazon.com/gp/cdp/member-reviews/ARPJ98Y7U8K5H?ie=UTF8&display=public&page=3&sort_by=MostRecentReview –