2017-03-31 47 views
0

這是我第一次使用美麗的湯或做網上刮for。到目前爲止,我已經足夠高興,但我遇到了一些障礙。用美麗的湯刮一個論壇 - 如何排除引用的回覆?

我想抓住特定線程上的所有帖子。不過,我想從引用的回覆中排除文字。

An example:

我想刮從這些帖子的文字不刮的紅色框所示的區域內的文本。

在html中,我想排除的部分在我需要爲消息選擇的部分內,這就是爲什麼我有困難。我已經包含了HTML的截圖

HTML image

<div id="post_message_39096267"><!-- google_ad_section_start --><div style="margin:20px; margin-top:5px; "> 
<div class="smallfont" style="margin-bottom:2px">Quote:</div> 
<table cellpadding="6" cellspacing="0" border="0" width="100%"> 
<tbody><tr> 
    <td class="alt2" style="border:1px inset"> 

      <div> 
       Originally Posted by <strong>SAAN</strong> 
       <a href="http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage-post33645660.html#post33645660" rel="nofollow"><img class="inlineimg li fs-viewpost" src="http://pics3.city-data.com/trn.gif" border="0" alt="View Post" title="View Post"></a> 
      </div> 
      <div style="font-style:italic">I agree with trying to buy a 
cheap car outright, the problem is everyone I know that has done that $2- 
5000 car, always ended up with these huge repair bills that are equivalent 
to car payments. Most cars after 100K will need all sort of regulatr 
maintance that is easily a $200 repair to go along with anything that may 
break which is common with cars as they age.<br> 
<br> 
I have a 2yr old im making payments on and 14yr old car that is paid off, 
but needs $2000 in maintenance. When car shopping this summer, I saw many 
cars i could buy outright, but after adding u everything needed to make sure 
it needs nothing, your back into the price range of a car payment.</div> 

    </td> 
</tr> 
</tbody></table> 
</div>Depends on how long the car loan would be stretched. Just because you 
can get an 8 year loan and reduce payments to a level like the repairs on 
your old car doesn't make it a good idea, especially for new cars that <a 
href="/knowledge/Depreciation.html" title="View 'depreciate' definition from 
Wikipedia" class="knldlink" rel="nofollow">depreciate</a> quickly. You'd 
just be putting yourself into negative equity territory.<!-- 
google_ad_section_end --></div> 

我已包括下面我的代碼:希望這將有助於你明白我在說什麼。

from bs4 import BeautifulSoup 
import urllib2 


num_pages = 101 
page_range = range(1,num_pages+1) 
clean_posts = [] 

for page in page_range: 
    print("Reading page: ", page, "...") 
    if page == 1: 
    page_url = urllib2.urlopen('http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage.html') 
    else: 
    page_url = urllib2.urlopen('http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage'+'-'+str(page)+'.html') 


soup = BeautifulSoup(page_url) 

postData = soup.find_all("div", id=lambda value: value and value.startswith("post_message_")) 

posts = [] 
for post in postData: 
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip().replace("\t", "")) 

posts_stripped = [x.replace("\n","") for x in posts] 

clean_posts.append(posts_stripped) 

最後,我想巨大感激,如果你能給我什麼工作,並解釋什麼東西給我,好像我是從字面上9歲的代碼示例!

乾杯 Diarmaid

回答

1

檢查您post_message_ div有另一個div內(報價DIV)。如果這樣解壓縮它。將原始div(post_message_)文本附加到您的列表中。用這個替換你的for post in postData

posts = [] 
for post in postData: 
    hasQuote = post.find("div") 
    if not hasQuote is None: 
     hasQuote.extract() 
    posts.append(post.get_text(strip=True)) 
+0

是的!謝謝!!!! –