使用Python抓取部分HTML 3

-1

我想根據我的喜好來格式化此HTML片段。使用Python抓取部分HTML 3

忽略*的

我只想學院論壇部分，

<*strong class="linkBlack">College Forum</strong*>

我已經嘗試了很多事情，包括正則表達式和翻譯，甚至取代但我似乎無法找到一種方法將HTML名稱從HTML

更多代碼（等級Grabber 2000）：http://pastebin.com/DMzZpZpp

來源

2017-02-23 PinkChicken

filter（None，re.split（））will do！

>>> #st is the input and res is the list 
>>> st="""'<strong class="linkBlack">College Forum</strong>, <strong class="linkBlack">Intro Info Tech</strong>, <strong class="linkBlack">Earth Science</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">Astronomy</strong>, <strong class="linkBlack">Computer Tech</strong>, <strong class="linkBlack">Human Geography H</strong>, <strong class="linkBlack">English 9</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">Chess</strong>, <strong class="linkBlack">College Forum</strong>, <strong class="linkBlack">Intro Info Tech</strong>, <strong class="linkBlack">Earth Science</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">Astronomy</strong>, <strong class="linkBlack">Computer Tech</strong>, <strong class="linkBlack">Human Geography H</strong>, <strong class="linkBlack">English 9</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">College Forum</strong>, <strong class="linkBlack">A+ Comp Rep/Maint</strong>, <strong class="linkBlack">Earth Science</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">Aikido</strong>, <strong class="linkBlack">Exploring Comp Sci</strong>, <strong class="linkBlack">World History H</strong>, <strong class="linkBlack">English 9</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">College Forum</strong>, <strong class="linkBlack">A+ Comp Rep/Maint</strong>, <strong class="linkBlack">Earth Science</strong>, <strong class="linkBlack">Sec. Math 1</strong>, <strong class="linkBlack">Aikido</strong>, <strong class="linkBlack">Exploring Comp Sci</strong>, <strong class="linkBlack">World History H</strong>, <strong class="linkBlack">English 9</strong>, <strong class="linkBlack">Sec. Math 1</strong>'""" 
>>> #split the string with comma and for each 
>>> #apply the regex filter after stripping the beginning and trailing white space. 
>>> res = [filter(None,re.split('(<strong class="linkBlack">)| 
(<\/strong*>)',s.strip()))[1] for s in st.split(",")] 
['College Forum', 'Intro Info Tech', 'Earth Science', 'Sec. Math 1', 'Astronomy', 'Computer Tech', 'Human Geography H', 'English 9', 'Sec. Math 1', 'Chess', 'College Forum', 'Intro Info Tech', 'Earth Science', 'Sec. Math 1', 'Astronomy', 'Computer Tech', 'Human Geography H', 'English 9', 'Sec. Math 1', 'College Forum', 'A+ Comp Rep/Maint', 'Earth Science', 'Sec. Math 1', 'Aikido', 'Exploring Comp Sci', 'World History H', 'English 9', 'Sec. Math 1', 'College Forum', 'A+ Comp Rep/Maint', 'Earth Science', 'Sec. Math 1', 'Aikido', 'Exploring Comp Sci', 'World History H', 'English 9', 'Sec. Math 1']

希望這會有所幫助！

來源

2017-02-27 03:29:52

我明白了，不過謝謝:) – PinkChicken

使用Python抓取部分HTML 3

回答

相關問題