網站數據抓取：拆分HTML內容

我刮的網站，我是能夠被稱爲「性別」變量減少到這一點：網站數據抓取：拆分HTML內容

[<span style="text-decoration: none;"> 
         Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100 
        </span>, <span style="text-decoration: none;">associé gérant </span>]

現在我想有隻「 associé「中的變量，但我無法找到一種方法來拆分這個HTML代碼。

原因是我想知道它是「associé」（男性）還是「associée」（女性）。

有沒有人有任何想法？

乾杯

-----編輯---- 這裏我的代碼，讓我的HTML輸出

url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false" 

r = requests.get(url) 
soup = BeautifulSoup(r.content,"lxml") 
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector: 
table2 = soup.select_one("#adm").find_all_next("table") 


output = table.select("td span[style^=text-decoration:]", limit=2) #.text.split(",", 1)[0].strip() 

print(output)

來源

2016-09-15 jjyoh

請顯示哪些代碼生產的此產品？謝謝。 – alecxe

是的，我確定現在編輯 – jjyoh

無論這兩個元素的父是你可以調用span:nth-of-type(2)獲得第二個跨度，那麼就檢查文本：

html = """<span style="text-decoration: none;"> 
         Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100 
        </span> 
      <span style="text-decoration: none;">associé gérant </span>""" 

soup = BeautifulSoup(html) 

text = soup.select_one("span:nth-of-type(2)").text

或者如果它並不總是第二範圍可以通過部分文本associé搜索範圍：

import re 
text = soup.find("span", text=re.compile(ur"associé")).text

您進行修改，所有你需要的是提取文本最後一個元素，並使用.split(None, 1)[1]獲得性別：

text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text 
gender = text.split(None, 1)[1] # > gérant

來源

2016-09-15 20:47:11

它給我一個錯誤：TypeError：期望的字符串或緩衝區 – jjyoh

@ J.jaques，。如果您正確使用它，我的代碼中沒有任何內容會這樣做。你究竟傳遞了什麼？ –

它完美的作品！非常感謝！ – jjyoh

網站數據抓取：拆分HTML內容

回答

相關問題