我需要幫助網絡抓取

所以我想從visual.ly中抓取可視化，但是現在我不明白「顯示更多」按鈕是如何工作的。截至目前，我的代碼將獲取圖像鏈接，圖像旁邊的文本以及頁面的鏈接。我想知道「顯示更多」按鈕的功能，因爲我將嘗試循環使用頁面數量。截至目前，我不知道如何通過每一個單獨循環。任何想法，我如何可以循環，並繼續獲得比他們最初顯示的更多的圖像？我需要幫助網絡抓取

from BeautifulSoup import BeautifulSoup 
import urllib2 
import HTMLParser 
import urllib, re 

counter = 1 
columnno = 1 
parser = HTMLParser.HTMLParser() 

soup = BeautifulSoup(urllib2.urlopen('http://visual.ly/?view=explore& type=static#v2_filter').read()) 

image = soup.findAll("div", attrs = {'class': 'view-mode-wrapper'}) 

if columnno < 4: 
    column = image[0].findAll("div", attrs = {'class': 'v2_grid_column'}) 
    columnno += 1 
else: 
    column = image[0].findAll("div", attrs = {'class': 'v2_grid_column last'}) 

visualizations = column[0].findAll("div", attrs = {'class': '0 v2_grid_item viewmode-item'}) 

getImage = visualizations[0].find("a") 

print counter 

print getImage['href'] 

soup1 = BeautifulSoup(urllib2.urlopen(getImage['href']).read()) 

theImage = soup1.findAll("div", attrs = {'class': 'ig-graphic-wrapper'}) 

text = soup1.findAll("div", attrs = {'class': 'ig-content-right'}) 

getText = text[0].findAll("div", attrs = {'class': 'ig-description right-section first'}) 

imageLink = theImage[0].find("a") 

print imageLink['href'] 

print getText 

for row in image: 
    theImage = image[0].find("a") 

    actually_download = False 
    if actually_download: 
     filename = link.split('/')[-1] 
     urllib.urlretrieve(link, filename) 

counter += 1

來源

2012-07-25 user1497050

你已經安裝了瀏覽器中的Web開發工具欄？我覺得這對於形象數據，按鈕動作，鏈接等等的可視化（雙關不打算）是非常有用的。 – Lenna 2012-07-25 18:58:03

如果打印鏈接指向正確的資源？這將是調試的第一步。 – 2012-07-25 19:05:36

不，我沒有網絡開發工具欄，除非你的意思是螢火蟲？ – user1497050 2012-07-25 19:16:13

您不能在這裏使用urllib分析器組合，因爲它使用JavaScript來加載更多的內容。爲了做到這一點，你需要一個完整的強制瀏覽器模擬器（支持javascript）。我從來沒有使用過Selenium，但我聽說它這樣做，以及具有python binding

然而，我發現，它使用了一個非常明確的形式

http://visual.ly/?page=<page_number>

其GET請求。也許更簡單的方法是去

<div class="view-mode-wrapper">...</div>

來解析數據（使用上面的url格式）。畢竟，ajax請求必須去一個位置。

那麼你可以做

for i in xrange(<whatever>): 
    url = r'http://visual.ly/?page={pagenum}'.format(pagenum=i) 
    #do whatever you want from here

來源

2012-08-03 18:32:53

我需要幫助網絡抓取

回答

相關問題