2015-04-02 617 views
3

這是我在StackOverflow上的第一個問題。我試圖在這裏找到解決問題的解決方案,但在嘗試了幾個建議的解決方案之後,我仍然無法讓我的代碼工作。lxml.html。讀取文件時出錯;無法加載外部實體

我想從YouTube上使用與lxml.html解析電影預告片網址:

from lxml import html 
import lxml.html 
from lxml.etree import XPath 

def get_youtube_trailer(selected_movie): 
# Create the url for the YouTube query in order to find the movie trailer 
title = selected_movie 
t = {'search_query' : title + ' movie trailer'} 
query_youtube = urllib.urlencode(t) 
search_url_youtube = 'https://www.youtube.com/results?' + query_youtube 

# Define the XPath for the YouTube movie trailer link 
movie_trailer_xpath = XPath('//ol[@class="item-section"]/li[1]/div/div/div[2]/h3/a/@href') 

# Parse the YouTube html code 
html = lxml.html.parse(search_url_youtube) 

# Add the movie trailer to our results 
results['movie_trailer'] = 'https://www.youtube.com' + movie_trailer_xpath(html)[0] 

我得到以下錯誤:

File "C:/Users/Aleks/Google Drive/Udacity - Full Stack Web Dev Nanodegree/Lessons/Lesson 3a (Make Classes) - Movie Website/models.py", line 163, in <module> 
print get_youtube_trailer("titanic") 

File "C:/Users/Aleks/Google Drive/Udacity - Full Stack Web Dev Nanodegree/Lessons/Lesson 3a (Make Classes) - Movie Website/models.py", line 157, in get_youtube_trailer 
html = lxml.html.parse(search_url_youtube) 
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse 
return etree.parse(filename_or_url, parser, base_url=base_url, **kw) 
File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453) 
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915) 
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214) 
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213) 
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163) 
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286) 
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722) 
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754) 
IOError: Error reading file 'https://www.youtube.com/results?search_query=titanic+movie+trailer': failed to load external entity "https://www.youtube.com/results?search_query=titanic+movie+trailer" 

我不知道我在做什麼錯在這裏因爲我試圖用解析來自其他網站的信息的完全相同的方式,然後它就起作用了。

您的幫助將不勝感激!謝謝!

回答

3

SSL/TLS is not supported by libxml2. Use Python's urllib2 instead.

如果你嘗試任何網址與http://<blah>.<blah>你不會有麻煩,但HTTPS此處不支持。還有重定向問題。

嘗試

from urllib2 import urlopen 
import lxml.html 
tree = lxml.html.parse(urlopen('https://google.com')) 

欲瞭解更多信息,請參閱this


解決方案

那麼有解決方法。嘗試硒,如果你不想要一個用戶界面,然後在無頭模式下運行硒。工作正常,我自己試了一下。

+0

謝謝@ iec2011007!這就是我一直在尋找的。 – alekscp 2017-06-06 12:21:40