1

我試圖在OldNavy網頁上刮掉產品的網址。但是,它只是給出產品列表的一部分而不是整個產品(例如,如果有8個以上的網址,只能提供8個網址)。我希望有人能幫助並確定問題可能是什麼。從頁面檢索所有信息BeautifulSoup

from bs4 import BeautifulSoup 
from selenium import webdriver 
import html5lib 
import platform 
import urllib 
import urllib2 
import json 


link = http://oldnavy.gap.com/browse/category.do?cid=1035712&sop=true 
base_url = "http://www.oldnavy.com" 

driver = webdriver.PhantomJS() 
driver.get(link) 
html = driver.page_source 
soup = BeautifulSoup(html, "html5lib") 
bigDiv = soup.findAll("div", class_="sp_sm spacing_small") 
for div in bigDiv: 
    links = div.findAll("a") 
    for i in links: 
    j = j + 1 
    productUrl = base_url + i["href"] 
    print productUrl 
+0

此代碼不能正常工作 - 你有沒有' 「」'網址和錯誤與'j'。在提出問題之前檢查代碼。 – furas

回答

1

此頁面使用JavaScript加載元素,但僅當您向下滾動頁面時加載元素。

所謂"lazy loading"

你必須得滾動頁面。

from selenium import webdriver 
from bs4 import BeautifulSoup 
import time 

link = "http://oldnavy.gap.com/browse/category.do?cid=1035712&sop=true" 
base_url = "http://www.oldnavy.com" 

driver = webdriver.PhantomJS() 
driver.get(link) 

# --- 

# scrolling 

lastHeight = driver.execute_script("return document.body.scrollHeight") 
#print(lastHeight) 

pause = 0.5 
while True: 
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
    time.sleep(pause) 
    newHeight = driver.execute_script("return document.body.scrollHeight") 
    if newHeight == lastHeight: 
     break 
    lastHeight = newHeight 
    #print(lastHeight) 

# --- 

html = driver.page_source 
soup = BeautifulSoup(html, "html5lib") 

#driver.find_element_by_class_name 

divs = soup.find_all("div", class_="sp_sm spacing_small") 
for div in divs: 
    links = div.find_all("a") 
    for link in links: 
    print base_url + link["href"] 

理念:https://stackoverflow.com/a/28928684/1832058