2017-06-22 85 views
0

我有一個網址列表(來自HuffPost UK)我需要從中獲取文本。我將它們存儲在csv文件中,但我只是將它們複製/粘貼到列表中。我的代碼存在兩個問題(過去與其他一些發佈者一起工作良好)。ContentDecodingError當獲得美麗的湯文本

  1. 它隨機停止併發生ContentDecodingError。
  2. 它隨機無法生成文本。

我說隨機,因爲當我運行它幾次,它停在不同的網址。有時它會打印文本,有時會打印相同URL的空字符串。我不知道發生了什麼事。任何人都可以提出什麼是錯的?我將非常感謝您的幫助。

我的代碼:

import codecs 
import translitcodec 
import requests 
from bs4 import BeautifulSoup 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml")  
    # delete unwanted tags: 
    for s in soup(['h2', 'figure', 'script', 'style', 'table']): 
     s.decompose() 
    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'content-list-component text'})]  
    text = ' '.join(article_soup) 
    text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii 
    text = u"{}".format(text) #encode to unicode 
    print text 
    return text 

urls = ['http://www.huffingtonpost.co.uk/2017/06/21/damian-green-tories-housing-education_n_17244280.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/the-waugh-zone-thursday-june-22-2017_n_17253136.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/argos-toys-christmas-2017_n_17248026.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/ore-oduba-strictly-come-dancing-joanne-clifton_n_17253186.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/joanne-clifton-flashdance-strictly-come-dancing_n_17253268.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/grenfell-tower-cladding-may-have-released-hydrogen-cyanide_n_17252776.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/uk-will-have-to-trawl-through-19000-eu-laws-to-decide-which-ones-to-keep-after-brexit_n_17242732.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-theresa-may_n_17241446.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/piers-morgan-good-morning-britain-bbc-breakfast-dan-walker-ratings_n_17252222.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/worst-bridezilla-stories-ever-reddit_n_.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/donald-trump-uk-state-visit-shelved-after-no-mention-in-queens-speech-2017_n_17239686.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/failure-may-state_n_17242710.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-13-things-missing-from-theresa-mays-first-one_n_17239692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/heartbroken-best-man-gatecrashes-bride-and-grooms-wedding-photos-and-its-comedy-gold_n_17253104.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-mocks-theresa-mays-imploding-minority-government_n_17242692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/asda-the-little-mermaid-swimsuit-topless_n_17253262.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/chaotic-brexit-theresa-may_n_17248024.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/the-waugh-zone-special-queens-speech-2017_n_17246444.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-residents-to-be-rehoused-in-luxury-kensington-row-flats_n_17242518.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/gin-does-not-help-relieve-hay-fever-experts-say_n_17243102.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/theresa-may-savoy_n_17227558.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/crewe-crane-collapse_n_17243884.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/rebecca-burger-french-fitness-blogger-killed-by-exploding-cream-dispenser_n_17253286.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/05/31/the-waugh-zone-may-31-201_0_n_16891450.html?ir=UK+Politics', 'http://www.huffingtonpost.co.uk/2017/06/22/theresa-may-reveals-tests-show-other-towers-combustible-following-grenfell-tower-fire_n_17253204.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/owen-jones-gleefully-brands-daily-mail-an-open-sewer_n_17253464.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/laura-kenny-interview-ambition-after-pregnancy_n_17252498.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/boris-johnson-radio-4-eddie-mair-two-ronnies_n_17245044.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-homes-theresa-may_n_17246764.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/dup-pushover-deal_n_17253218.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/khan-remain-rights_n_17243656.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/love-island-zara-holland-sex-miss-great-britain_n_17242768.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/man-sent-home-from-work-wearing-shorts_n_17243276.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/courteney-cox-fillers-surgery-face_n_17252410.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/jeremy-corbyn-observed-protocol-by-not-bowing-to-the-queen_n_17240658.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/alexandra-shulman-british-vogue-good-morning-britain-the-queen_n_17253200.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/teaching-excellence-framework-results-universities-gold-ranking_n_17253426.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/prince-harry-slams-decision-to-make-him-walk-behind-princess-dianas-coffin_n_17253188.html?utm_hp_ref=uk'] 
for url in urls: 
    print url 
    text = get_text(url) 

錯誤:

--------------------------------------------------------------------------- 
ContentDecodingError      Traceback (most recent call last) 
<ipython-input-12-54bdf2585415> in <module>() 
    21 for url in urls: 
    22  print url 
---> 23  text = get_text(url) 

<ipython-input-12-54bdf2585415> in get_text(url) 
     5 
     6 def get_text(url): 
----> 7  r = requests.get(url) 
     8  soup = BeautifulSoup(r.content, "lxml") 
     9  # delete unwanted tags: 

/Applications/anaconda/lib/python2.7/site-packages/requests/api.pyc in get(url, params, **kwargs) 
    68 
    69  kwargs.setdefault('allow_redirects', True) 
---> 70  return request('get', url, params=params, **kwargs) 
    71 
    72 

/Applications/anaconda/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs) 
    54  # cases, and look like a memory leak in others. 
    55  with sessions.Session() as session: 
---> 56   return session.request(method=method, url=url, **kwargs) 
    57 
    58 

/Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 
    486   } 
    487   send_kwargs.update(settings) 
--> 488   resp = self.send(prep, **send_kwargs) 
    489 
    490   return resp 

/Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs) 
    628 
    629   # Resolve redirects if allowed. 
--> 630   history = [resp for resp in gen] if allow_redirects else [] 
    631 
    632   # Shuffle things around if there's history. 

/Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies, **adapter_kwargs) 
    188     proxies=proxies, 
    189     allow_redirects=False, 
--> 190     **adapter_kwargs 
    191   ) 
    192 

/Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs) 
    639 
    640   if not stream: 
--> 641    r.content 
    642 
    643   return r 

/Applications/anaconda/lib/python2.7/site-packages/requests/models.pyc in content(self) 
    795     self._content = None 
    796    else: 
--> 797     self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes() 
    798 
    799   self._content_consumed = True 

/Applications/anaconda/lib/python2.7/site-packages/requests/models.pyc in generate() 
    722      raise ChunkedEncodingError(e) 
    723     except DecodeError as e: 
--> 724      raise ContentDecodingError(e) 
    725     except ReadTimeoutError as e: 
    726      raise ConnectionError(e) 

ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',)) 

回答

0

我終於設法解決這個問題。我需要在打開每個URL之前使用Selenium & PhantomJS以允許頁面正確加載。

這段代碼我創建我湯之前加入有助於解決該問題:

driver = webdriver.PhantomJS(executable_path='PATH TO phantomjs') 
driver.get(url) 
waitForLoad(driver) 
html = driver.page_source 
soup = BeautifulSoup(html, "lxml") 

我也用功能waitForLoad(驅動程序):this O'Reilly book說明。

這是最後的工作代碼:

import codecs 
import translitcodec 
import requests 
from bs4 import BeautifulSoup 
from selenium import webdriver 
import time 
from selenium.webdriver.remote.webelement import WebElement 
from selenium.common.exceptions import StaleElementReferenceException 

def waitForLoad(driver): 
    elem = driver.find_element_by_tag_name("html") 
    count = 0 
    while True: 
     count += 1 
     if count > 20: 
      print("Timing out after 10 seconds and returning") 
      return 
     time.sleep(.5) 
     try: 
      elem == driver.find_element_by_tag_name("html") 
     except StaleElementReferenceException: 
      return 

def get_text(url): 
    driver = webdriver.PhantomJS(executable_path='PATH TO phantomjs') 
    driver.get(url) 
    waitForLoad(driver) 
    html = driver.page_source 
    soup = BeautifulSoup(html, "lxml") 
    # delete unwanted tags: 
    for s in soup(['h2', 'figure', 'script', 'style', 'table']): 
     s.decompose() 
    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'content-list-component text'})]  
    text = ' '.join(article_soup) 
    text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii 
    text = u"{}".format(text) #encode to unicode 
    print text 
    return text 

urls = ['http://www.huffingtonpost.co.uk/2017/06/21/damian-green-tories-housing-education_n_17244280.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/the-waugh-zone-thursday-june-22-2017_n_17253136.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/argos-toys-christmas-2017_n_17248026.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/ore-oduba-strictly-come-dancing-joanne-clifton_n_17253186.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/joanne-clifton-flashdance-strictly-come-dancing_n_17253268.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/grenfell-tower-cladding-may-have-released-hydrogen-cyanide_n_17252776.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/uk-will-have-to-trawl-through-19000-eu-laws-to-decide-which-ones-to-keep-after-brexit_n_17242732.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-theresa-may_n_17241446.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/piers-morgan-good-morning-britain-bbc-breakfast-dan-walker-ratings_n_17252222.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/worst-bridezilla-stories-ever-reddit_n_.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/donald-trump-uk-state-visit-shelved-after-no-mention-in-queens-speech-2017_n_17239686.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/failure-may-state_n_17242710.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-13-things-missing-from-theresa-mays-first-one_n_17239692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/heartbroken-best-man-gatecrashes-bride-and-grooms-wedding-photos-and-its-comedy-gold_n_17253104.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-mocks-theresa-mays-imploding-minority-government_n_17242692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/asda-the-little-mermaid-swimsuit-topless_n_17253262.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/chaotic-brexit-theresa-may_n_17248024.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/the-waugh-zone-special-queens-speech-2017_n_17246444.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-residents-to-be-rehoused-in-luxury-kensington-row-flats_n_17242518.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/gin-does-not-help-relieve-hay-fever-experts-say_n_17243102.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/theresa-may-savoy_n_17227558.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/crewe-crane-collapse_n_17243884.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/rebecca-burger-french-fitness-blogger-killed-by-exploding-cream-dispenser_n_17253286.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/05/31/the-waugh-zone-may-31-201_0_n_16891450.html?ir=UK+Politics', 'http://www.huffingtonpost.co.uk/2017/06/22/theresa-may-reveals-tests-show-other-towers-combustible-following-grenfell-tower-fire_n_17253204.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/owen-jones-gleefully-brands-daily-mail-an-open-sewer_n_17253464.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/laura-kenny-interview-ambition-after-pregnancy_n_17252498.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/boris-johnson-radio-4-eddie-mair-two-ronnies_n_17245044.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-homes-theresa-may_n_17246764.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/dup-pushover-deal_n_17253218.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/khan-remain-rights_n_17243656.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/love-island-zara-holland-sex-miss-great-britain_n_17242768.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/man-sent-home-from-work-wearing-shorts_n_17243276.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/courteney-cox-fillers-surgery-face_n_17252410.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/jeremy-corbyn-observed-protocol-by-not-bowing-to-the-queen_n_17240658.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/alexandra-shulman-british-vogue-good-morning-britain-the-queen_n_17253200.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/teaching-excellence-framework-results-universities-gold-ranking_n_17253426.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/prince-harry-slams-decision-to-make-him-walk-behind-princess-dianas-coffin_n_17253188.html?utm_hp_ref=uk'] 
for url in urls: 
    print url 
    text = get_text(url)