2017-02-24 40 views
1

我只是試圖從這樣一個網頁得到一些數據:獲得從HTML頁面數據成Python陣列

[ . . . ] 

<p class="special-large">Lorem Ipsum 01</p> 
<p class="special-large">Lorem Ipsum 02</p> 
<p class="special-large">Lorem Ipsum 03</p> 
<p class="special-large">Lorem Ipsum 04</p> 
<p class="special-large">Lorem Ipsum 05</p> 

[ . . . ] 

我想有一個python陣列類似以下:

myArrayWebPage = ["Lorem Ipsum 01","Lorem Ipsum 02","Lorem Ipsum 03","Lorem Ipsum 04","Lorem Ipsum 05"] 

這是我的Python腳本:

import urllib.request 

urlAddress = "http:// ... /" # my url address 
getPage = urllib.request.urlopen(urlAddress) 
outputPage = getPage.read() 
print(outputPage) 

我怎樣才能從 「outputPage」 的陣列?

回答

1

這似乎做你想要什麼:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32 
Type "copyright", "credits" or "license()" for more information. 
>>> html = '''<p class="special-large">Lorem Ipsum 01</p> 
<p class="special-large">Lorem Ipsum 02</p> 
<p class="special-large">Lorem Ipsum 03</p> 
<p class="special-large">Lorem Ipsum 04</p> 
<p class="special-large">Lorem Ipsum 05</p>''' 
>>> import re 
>>> re.findall('<p class="special-large">([^<]+)</p>', html) 
['Lorem Ipsum 01', 'Lorem Ipsum 02', 'Lorem Ipsum 03', 'Lorem Ipsum 04', 'Lorem Ipsum 05'] 
>>> 

請注意,regular expressions通常不優選這樣的事情。您應該使用類似Beautiful Soup的庫。

+0

謝謝!我能問你「正則表達式」是什麼意思嗎? –

+0

你可以點擊現在的術語,維基百科的文章就會出現。下次嘗試在Google上搜索您不熟悉的術語。 –

+0

@JoeHunter請藉此機會閱讀爲什麼正則表達式不足以解析HTML的瘋狂有趣的答案:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-標籤 –