2016-12-15 49 views
0

我試圖在標記中提取特定的字符串並保存它們(對於此行更復雜的處理)。所以說,例如,我在一條直線從一個文件中讀取當前行是:如何提取python中的特定字符串

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all"> 

但我想存儲:

tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg' 

tempWidth = 500 

tempHeight = 375 

tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road' 

我怎麼會去這樣做在Python ?

感謝

+0

讓我爲你省去麻煩,並告訴你正則表達式出於此目的。不要以爲嘗試它,你以後只會碰到你的頭。如果數據來自Web源,請查看BeautifulSoup或scrapy或任何其他「抓取」庫。如果你已經有了標記,你可以使用解析器並遍歷節點並收集屬性信息。 –

+0

['HTMLParser'](https://docs.python.org/2/library/htmlparser.html)或['html.parser'](https://docs.python.org/3.4/library/html。 parser.html)取決於python版本 –

回答

3

雖然你可以用幾種方法擺脫這裏,我建議使用一個HTML解析器,這是可擴展的,並且可以處理的HTML的許多問題。下面是與BeautifulSoup工作的例子:

>>> from bs4 import BeautifulSoup 
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">""" 
>>> soup = BeautifulSoup(string, 'html.parser') 
>>> for attr in ['width', 'height', 'alt']: 
...  print('temp{} = {}'.format(attr.title(), soup.img[attr])) 
... 
tempWidth = 500 
tempHeight = 375 
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road 
+0

最終得到bs4安裝後,這是一個美麗的解決方案。謝謝! – Johnny

0

而正則表達式的方法:

import re 

string = "YOUR STRING" 
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0] 
tempUrl = matches[0] 
tempWidth = matches[1] 
tempHeight = matches[2] 
tempAlt = matches[3] 

所有值都是串的,所以如果你想投吧..

,知道用正則表達式副本/粘貼是一個壞主意。容易出錯。