改善一個python片段

我正在研究一個python腳本來做一些網頁報廢。我想找到一個網頁上給定的部分看起來像這樣的基地網址：改善一個python片段

<div class='pagination'> 
    <a href='webpage-category/page/1'>1</a> 
    <a href='webpage-category/page/2'>2</a> 
    ... 
</div>

所以，我只需要得到一切從第一HREF除號（「網頁類別/頁/ 「），我有以下工作代碼：

pages = [l['href'] for link in soup.find_all('div', class_='pagination') 
    for l in link.find_all('a') if not re.search('pageSub', l['href'])] 

s = pages[0] 
f = ''.join([i for i in s if not i.isdigit()])

的問題是，產生這個名單是一種浪費，因爲我只需要第一個HREF。我認爲發電機將是答案，但我無法解決這個問題。也許你們可以幫我讓這段代碼更簡潔？

來源

2014-03-13 XVirtusX

這個什麼：

from bs4 import BeautifulSoup 

html = """ <div class='pagination'> 
    <a href='webpage-category/page/1'>1</a> 
    <a href='webpage-category/page/2'>2</a> 
</div>""" 

soup = BeautifulSoup(html) 

link = soup.find('div', {'class': 'pagination'}).find('a')['href'] 

print '/'.join(link.split('/')[:-1])

打印：

webpage-category/page

僅供參考，談論你所提供的代碼 - 您可以使用[下一個（）] [ - 1]，而不是列表理解：

s = next(l['href'] for link in soup.find_all('div', class_='pagination') 
     for l in link.find_all('a') if not re.search('pageSub', l['href']))

UPD（使用提供的網站鏈接）：

import urllib2 
from bs4 import BeautifulSoup 


url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2" 
soup = BeautifulSoup(urllib2.urlopen(url)) 

links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a') 

print next('/'.join(link['href'].split('/')[:-1]) for link in links 
      if link.text.isdigit() and link.text != "1")

來源

2014-03-13 17:13:36 alecxe

好吧，你差不多了。但實際上該頁面有兩個'分頁'div，其中一個具有以下結構（'網頁類別/ pageSub/1'）。這個對我不感興趣，所以我通過重新丟棄它。你可以將所有這些分配到一個班輪嗎？ – XVirtusX

@XVirtusX好的，當然。你能告訴我相關的html或網站的鏈接嗎？我很確定，這個任務可以用比'href'使用正則表達式過濾鏈接更清潔的方式來完成。謝謝。 – alecxe

網站：http：//www.hdwallpapers.in/cars-desktop-wallpapers/page/2 – XVirtusX

改善一個python片段

回答

相關問題