1

我想放棄的網站:報廢文章與Python 3.4和BeautifulSoup,請

https://xueqiu.com/yaodewang 

而且我想放棄他的所有文章。我使用BeautifulSoup和採購這樣的:

import requests 
from bs4 import BeautifulSoup 
url = 'https://xueqiu.com/yaodewang' 
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'} 
r = requests.get(url,headers = header).content 
soup = BeautifulSoup(r,'lxml') 
artile = soup.find_all('ul',{'class':'status-list'}) 
print(artile) 

結果是什麼這是回報!

[] 

SO,我TYR另一個規則是這樣的:

# art = soup.find_all('div',{'class':'allStatuses no-head'}) 
# art = soup.find_all('div',{'class':'status_bd'}) 
# art = soup.find_all('div',{'class':'status_content container active tab-pane'}) 

但是,它返回了一些不正確的詞。 我想要這樣的內容enter image description here

我需要你的幫助,非常感謝!

回答

1

所需的數據實際上不在status-list類的元素中。如果你想查看源代碼,你會發現一個空的容器,而不是:

<div class="status_bd"> 
    <div id="statusLists" class="allStatuses no-head"></div> 
</div> 

相反,狀態都位於script元素,你需要找到裏面,提取所需的對象,從JSON加載到Python字典並提取所需的信息:

import json 
import re 
import requests 
from bs4 import BeautifulSoup 

url = 'https://xueqiu.com/yaodewang' 
headers = { 
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36' 
} 
r = requests.get(url, headers=headers).content 
soup = BeautifulSoup(r, 'lxml') 

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

data = json.loads(pattern.search(script.text).group(1)) 
for item in data["statuses"]: 
    print(item["description"]) 

打印:

The best advice: Remember common courtesy and act toward others as you want them to act toward you. 
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week... 
... 
點.點.點... 點到這個,學位、學歷、成績單翻譯一下要50塊、100塊的... 
+0

非常感謝你much.It是一個正確的methlod但是,我想知道,如果我知道conten! t位於腳本中,我如何找到這樣的正則表達式:pattern = re.compile(r「SNB \ .data \ .statuses =({。*?});」,re.MULTILINE | re.DOTALL) –

+0

另一個問題:我想獲得artiles的列表,但現在,我得到了一個字符串。我想得到這樣的結果= [str01,str02 .....] –

+0

@championCh當然,只是提取腳本文本並使用它,例如[regex101](https://regex101.com/)。至於你的第二個問題,我認爲你是在詢問如何將結果放入一個列表中:'articles = [item [「description」] for data in data [「statuses」]]]'。希望有所幫助。 – alecxe