蟒蛇：排除字符串正則表達式

我試圖建立一個網站刮板獲得價格折扣http://fetch.co.uk/dogs/dog-food?per-page=20 蟒蛇：排除字符串正則表達式

我這裏有下面的代碼：

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen(url above) 
bsObj = BeautifulSoup(html,"html.parser") 

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")}) 
for wrap in wrapList: 
    print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text()) 
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

在每一個包裹，有時有2個不同的價格，我試圖排除降價並獲得低於該價格（促銷價格）的價格。

我無法弄清楚如何排除切割價格，上述表達式不起作用。

"shelf-product__price shelf-product__price--cut [ v2 ]" 
"shelf-product__price shelf-product__price--promo [ v2 ]"

我用下面的方法，但我想了解我得到錯誤的正則表達式。對不起，如果代碼不漂亮，我正在學習

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen(url above) 
bsObj = BeautifulSoup(html,"html.parser") 

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")}) 
for wrap in wrapList: 
    print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text()) 
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

來源

2016-01-24 Elena ZdeG

所提到的URL不似乎與'itemprop =任何元件「保質product__price貨架product__price - 切[V2]」'值用於'itemprop'要麼'title'或'價格'。這就是爲什麼「price。*」的第二個正則表達式正在工作。 – mchackam

@mchackam：它的確是'class'屬性而不是'itemprop'屬性，但它不是唯一的問題。當一個屬性有多個由空格分隔的值時，條件會分別在每個值上進行測試，直到一個成功*（而不是整個屬性）*。在任何情況下，正則表達式都是錯誤的，使用正則表達式不是這裏的好方法，它更容易使用函數作爲條件。在循環中放置模式編譯會減慢代碼的速度。 –

有幾個問題。首先是.*(?!cut).*相當於.*。這是因爲第一個.*會消耗所有剩餘的字符。那麼當然(?!cut)檢查通過，因爲它在字符串的末尾。最後.*消耗0個字符。所以它總是一場比賽。這個正則表達式會給你誤報。它給你什麼都沒有的唯一原因是你正在尋找itemprop當你正在尋找的文本是在class。

您的解決方法對我來說看起來不錯。但是如果你想在課堂上進行搜索，我會這樣做。

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen('http://fetch.co.uk/dogs/dog-food?per-page=20') 
bsObj = BeautifulSoup(html,"html.parser") 

wrapList = bsObj.findAll("",{"class": "shelf-product__self"}) 

def is_price(tag): 
    return tag.has_attr('class') and \ 
      'shelf-product__price' in tag['class'] and \ 
      'shelf-product__price--cut' not in tag['class'] 

for wrap in wrapList: 
    print(wrap.find(is_price).text) 
    x=wrap.find("",{"class": "shelf-product__title"}).get_text()

正則表達式很好，但我認爲用布爾值來做布爾邏輯更容易。

來源

2016-01-24 15:40:04

你也可以避開第一個正則表達式。 –

當然，編輯的一致性。 –

爲什麼要使用那個複雜的代碼，你可以嘗試以下 - span[itemprop=price]意味着選擇所有span有屬性itemprop是price。

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

#get possible list of urls 
urls = ['http://fetch.co.uk/dogs/dog-food?per-page=%s'%n for n in range(1,100)] 

for url in urls: 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html,"html.parser") 
    for y in [i.text for i in bsObj.select("span[itemprop=price]")]: 
    print y.encode('utf-8')

來源

2016-01-24 16:30:25 SIslam

使用select似乎是合理的，但代碼存在一些問題。它使用python2，其中的問題使用python3。它嘗試不同的每頁值，我不知道爲什麼（這不是一個頁碼）。 'respons.content'應該是'html'。 ''t'in ..]'什麼都不做。價格也應與產品名稱相關聯。最後一點可能會阻止您使用select。 –

確定編輯了該拼寫錯誤並複製了粘貼錯誤 – SIslam

蟒蛇：排除字符串正則表達式

回答

相關問題