2014-10-10 23 views
0

嘗試抓取文章時我需要做些什麼,但是一個排序的廣告不斷出現?具體來說,那些會在屏幕中間彈出,請求登錄/註冊,並且您必須在閱讀之前手動關閉它。搞亂我的文章爬行的廣告

因此,我的抓取無法提取任何東西。有關如何使用pyquery在「抓取前關閉廣告」中編寫代碼的任何建議?

編輯:現在和Selenium一起嘗試去除彈出窗口。任何意見將不勝感激。

import mechanize 
import time 
import urllib2 
import pdb 
import lxml.html 
import re 
from pyquery import PyQuery as pq 

def open_url(url):  
    print 'open url:',url 

try:  
    br = mechanize.Browser() 
    br.set_handle_equiv(True) 
    br.set_handle_redirect(True) 
    br.set_handle_referer(True) 
    br.set_handle_robots(False) 
    br.addheaders = [('user-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3')] 
    response = br.open(url) 
    html = response.get_data() 
    return html 
except: 
    print u"!!!! url can not be open by mechanize either!!! \n" 

def extract_text_pyquery(html): 
    p = pq(html) 
    article_whole = p.find(".entry-content") 
    p_tag = article_whole('p') 
    print len(p_tag) 
    print p_tag 
    for i in range (0, len(p_tag)): 
     text = p_tag.eq(i).text() 
     print text 
    entire = p.find(".grid_12") 
    author = entire.find('p') 
    print len(author) 
    print "By:", author.text() 

    images = p.find('#main_photo') 
    link = images('img') 
    print len(link) 
    for i in range(len(link)): 
    url = pq(link[i]) 

    result =url.attr('src').find('smedia') 
    if result>0: 
     print url.attr('src') 



if __name__ =='__main__': 
    #print '----------------------------------------------------------------' 

url_list = ['http://www.newsobserver.com/2014/10/17/4240490/obama-weighs-ebola-czar-texas.html?sp=/99/100/&ihp=1', 


      ] 
html= open_url(url_list[0]) 
# dissect_article(html) 
extract_text_pyquery(html) 

回答

0

如果你打算在抓取特定網站,那麼你可以檢查元素與id="continue_link,並從拉在href。然後加載該頁面並刮擦。

例如在url_list網址它含有這種元素

<a href="http://www.bnd.com/2014/10/10/3447693_rude-high-school-football-players.html?rh=1" id="continue_link" class="wp_bold_link wp_color_link wp_goto_link">Skip this ad</a> 

然後,您可以直接導航到該鏈接沒有任何類型的廣告網關。我比BeautifulSoup更熟悉,但你似乎可以做類似

p = pq(html) 
if p.find("#continue_link): 
    continue_link = p.find("#continue_link") 
    html = open_url(continue_link('href')) 
    extract_text_pyquery(html) 
    return 
<rest of code if there is no continue link>