2017-04-22 72 views
0

我試圖解析從本網站鏈接的意見「查看更多」選項數據: I need to get 1000 comments, by default it shows only 10如何颳去網站有使用BeautifulSoup圖書館在Python

我想1000層的意見,這表明只有10默認。我無法想出一個辦法,點擊「查看更多」

後能得到它顯示在網頁上的內容,我有以下代碼uptil現在:

import urllib.request 
from bs4 import BeautifulSoup 
import sys 

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd) 

response = urllib.request.urlopen("https://www.mygov.in/group-issue/share- 
your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/") 

srcode = response.read() 

soup = BeautifulSoup(srcode, "html.parser") 

all_comments_div=soup.find_all('div', class_="comment_body"); 

all_comments=[] 
for div in all_comments_div: 
    all_comments.append(div.find('p').text.translate(non_bmp_map)) 



print (all_comments) 
print (len(all_comments)) 
+0

你是如何試圖點擊「查看更多」 ......硒,或者是你抓住「下一步」 href和直接請求頁面? – pbuck

回答

0

您可以使用while循環來獲得接下來的幾頁
(即同時有下一個頁面,所有的意見都小於1000)

import urllib.request 
from bs4 import BeautifulSoup 
import sys 

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd) 
all_comments = [] 
max_comments = 1000 
base_url = 'https://www.mygov.in/' 
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/' 

while next_page and len(all_comments) < max_comments : 
    response = response = urllib.request.urlopen(next_page) 
    srcode = response.read() 
    soup = BeautifulSoup(srcode, "html.parser") 

    all_comments_div=soup.find_all('div', class_="comment_body"); 
    for div in all_comments_div: 
     all_comments.append(div.find('p').text.translate(non_bmp_map)) 

    next_page = soup.find('li', class_='pager-next first last') 
    if next_page : 
     next_page = base_url + next_page.find('a').get('href') 
    print('comments: {}'.format(len(all_comments))) 

print(all_comments) 
print(len(all_comments)) 
1

新的評論通過ajax加載,我們需要分析它,然後用bs,即:

import json 
import requests 
import sys 
from bs4 import BeautifulSoup 

how_many_pages = 5 # how many comments pages you want to parse? 
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd) 
all_comments = [] 

for x in range(how_many_pages): 
    # note: mygov.in seems very slow... 
    json_data = requests.get(
     "https://www.mygov.in/views/ajax/?view_name=view_comments&view_display_id=block_2&view_args=267721&view_path=node%2\ 
F267721&view_base_path=comment_pdf_export&view_dom_id=f3a7ae636cabc2c47a14cebc954a2ff0&pager_element=1&sort_by=created&sort_order=DESC&page=0,{}"\ 
      .format(x)).content 
    d = json.loads(json_data.decode()) # Remove .decode() for python < 3 
    print(len(d)) 
    if len(d) == 3: # sometimes json lenght is 3 
     comments = d[2]['data'] # data is the key that contains the comments html 
    elif len(d) == 2: # others just 2... 
     comments = d[1]['data'] 

    #From here, we can use your BeautifulSoup code. 
    soup = BeautifulSoup(comments, "html.parser") 
    all_comments_div = soup.find_all('div', class_="comment_body"); 

    for div in all_comments_div: 
     all_comments.append(div.find('p').text.translate(non_bmp_map)) 


print(all_comments) 

輸出

["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession,...']