2014-10-10 77 views
0

我想從文件和輸出取URL頁面的標題:打印標題

import lxml.html 
file = open('ab.txt','r') 
for line in file: 
    t = lxml.html.parse(line) 
    print t.find(".//title").text 

錯誤:

Traceback (most recent call last): 
    File "C:\Python27\site.py", line 4, in <module> 
    t = lxml.html.parse(line) 
    File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 661, in parse 
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw) 
    File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958) 
    File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797) 
    File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080) 
    File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175) 
    File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173) 
    File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) 
    File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) 
    File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493) 
IOError: Error reading file 'http://example.com/5129860 
': failed to load HTTP resource 

的ab.txt有:

example.com/123 

    example.com/234 

    example.com/456 
    .... 

這裏有什麼問題嗎?

+0

什麼是您預期的輸出?你想下載每個url的內容並打印標題嗎? – 2014-10-10 12:05:34

回答

0
for line in file: 
    t = lxml.html.parse(line) 
    print t.find(".//title").text 

這裏你試圖讀取每一行,並使用lxml.html.parse這意味着該參數的功能是不是有效的HTTP內容分析每一行。你應該modifiying這些行作爲

from urllib2 import urlopen 

for line in file: 
    content = urlopen(line) 
    t = lxml.html.parse(content) 
    print t.find(".//title").text 

這裏,文件的全部內容,讀取到變量content。它旁邊有一個有效的http內容。

+0

仍然無法加載http資源。 – 2014-10-10 11:33:34

+0

請參閱編輯它應該工作。文件名稱將被提及。 – nu11p01n73R 2014-10-10 11:41:05

+0

@ nu11p01n73R我已經回答了你編輯過的部分。 – 2014-10-10 11:46:11

1

parse方法lxml.html將文件名,URL或文件類對象解析爲HTML文檔並返回樹。從文檔中,這個函數的參數是這樣的,

parse(filename_or_url, parser=None, base_url=None, **kw) 

所以你可以直接傳遞文件名並得到你的輸出。

t = lxml.html.parse('ab.txt') 
print t.find(".//title").text 
+0

打印t.find( 「.//標題」)返回無 – 2014-10-10 11:47:18

+0

顯示您的文件'ab.txt' – 2014-10-10 11:47:55

+0

example.com/123 example.com/234 example.com/456 這是格式 – 2014-10-10 11:48:59