2016-12-24 60 views
-1

我只想獲取任何網站頁面內容的文本。我正在使用BeautifulSoup來做到這一點。使用BeautifulSoup從網站中提取文本

我寫了一個函數象下面這樣:

def textClean(text): 
    """ This function takes the input text and cleans the HTML tags from it 

    """ 

    from bs4 import BeautifulSoup 
    souptext=BeautifulSoup(text) 
    print text 
    print souptext.get_text() 

這將打印原始的HTML源代碼,然後那太文本。

但是這裏是一個示例輸出,我得到:

HTML輸出:(第一print語句)

<p><img style="float:right;" src="http://static4.businessinsider.com/image/56eb68e791058427008b72e5-907-680/5550538407_c22babffba_b.jpg" alt="radar" data-mce-source="US Navy" data-mce-caption="Mineman Seaman Charles Bryan watches for contacts on the SPA 256 radar while on watch in the Combat Directive Center aboard the mine countermeasures ship USS Ardent (MCM 12)." data-link="https://www.flickr.com/photos/usnavy/5550538407/in/photolist-9stXG4-e6i1uU-e6i1tE-dLSiBQ-c9jmg7-f5LbtS-r9jw69-efvjaN-duNiV6-efpeEP-eW8Dg9-q1nZiQ-en2osX-duNiTa-njkj3s-eep3Mb-kUdU5g-9d7u4E-eeoYiC-fr2CuX-axHdte-fsVD3D-drHPeJ-9rAVac-cnMSiW-9vVcbN-enB31b-f23pKF-aBjveY-9rEhwY-9u6GZy-9rDT9L-bojAAh-9uiNiU-9AJSrB-9rFxwQ-bjkanD-aefpN9-ea2WB2-ea2WyR-a1tUoa-9rAUXZ-ea8Bf9-9Wm3Z8-9rNE7o-enB1YY-9rAUHX-ea2WpF-aNR7eD-9NX2pq" /><span class="source">US Navy</span></p><p>The United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday.</p>

二TET輸出:(第二個print語句)

US NavyThe United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday. 

如果你看到

<span class="source">US Navy</span></p> 

標籤之間的文字也越來越提取,我不希望我們彷彿看到原來的文章(以下鏈接),文本不是原始文章的一部分。我知道get_text()會獲取所有文本,所以我想要一個簡單的解決方案,我們可以指定提取段落標籤之間的文本,但不包括span標籤,因爲我不認爲span標籤中的文本是零件的原文。

這裏是我使用的文章的鏈接。

enter link description here

EDIT1:

獲取輸出是這樣的:每一列被轉換爲Unicode。

這裏是我寫的映射函數代碼,用於映射Spark DataFrame的每條記錄,並清除數據框'desc'列中的HTML標記。

def htmlParsing(x): 
    """ This function takes the input text and cleans the HTML tags from it 

    """ 

    from bs4 import BeautifulSoup 
    #print text 
    row=x.asDict() 
    textcleaned='' 
    souptext=BeautifulSoup(row['desc']) 
    #souptext=BeautifulSoup(text) 
    p_tags=souptext.find_all('p') 
    for p in p_tags: 
     if p.string: 
      #textcleaned+=p.string 
      ret_list= (int(row['id']),(row['title']),(p.string)) 
      return ret_list 
      #print p.string 


sdf_cleaned=sdf_rss.map(htmlParsing)   

sdf_cleaned.take(4) 

[(-33753621, 蘇格蘭u'Royal銀行正在測試可以解決您的銀行問題的機器人(RBS)」, u'If你討厭使用銀行櫃檯或客戶服務代表,然後'), (-761323061, )你的性別色情正在促使對兒童色情法律進行徹底檢查', u'Rampant青少年性行爲已經讓政治家和執法機構圍繞着國家正在努力尋找起訴學生爲兒童色情和讓他們脫身的某種法律中間地帶。'), (1405376555, 經過進一步的審查,中國已經開始在南中國海建設一個新項目, u美國已經看到中國在四年前從菲律賓掠奪中國的一塊礁石上開展的活動,這可能是美國海軍總司令週四表示,在有爭議的南中國海進行更多的填海工程。'), (-1882022821, u'Ingition鎖定法律正在降低酒後駕駛死亡率, u'Reuters健康狀況 - 要求定罪醉酒司機在他們的汽車中安裝點火聯鎖裝置的州有15%的下降相比於沒有這些要求的國家,研究表明,酒精相關的撞車死亡。')]

回答

0
import requests, bs4 
r = requests.get('http://www.businessinsider.com/r-exclusive-us-sees-new-chinese-activity-around-south-china-sea-shoal-2016-3') 
soup = bs4.BeautifulSoup(r.text, 'lxml') 

p_tags = soup.find_all('p') 
for p in p_tags: 
    if p.string: 
     print(p.string) 

.string

如果一個標籤只有一個孩子,那孩子是個 NavigableString,該孩子可以作爲.string:

如果標籤 包含超過一兩件事,那就不是清楚.string應該 參考,所以.string被定義爲無:

所以,蜇將只返回p標籤只有包含文字。

出來:

The United States has seen Chinese activity around a reef that 
    China seized from the Philippines nearly four years ago that 
    could be a precursor to more land reclamation in the disputed 
    South China Sea, the U.S. Navy chief said on Thursday. 


    The head of U.S. naval operations, Admiral John Richardson, 
    expressed concern that an international court ruling expected in 
    coming weeks on a case brought by the Philippines against China 
    over its South China Sea claims could be a trigger for Beijing to 
    declare an exclusion zone in the busy trade route. 


    Richardson told Reuters the United States was weighing responses 
    to such a move. 


    He said the U.S. military had seen Chinese activity around 
    Scarborough Shoal in the northern part of the Spratly 
    archipelago, about 125 miles (200 km) west of the Philippine base 
    of Subic Bay. 


    "I think we see some surface ship activity and those sorts of 
    things, survey type of activity, going on. Thatâs an area of 
    concern ... a next possible area of reclamation," he said. 


    Richardson said it was unclear if the activity near the reef, 
    which China seized in 2012, was related to the pending 
    arbitration decision. 


    He said China's pursuit of South China Sea territory, which has 
    included massive land reclamation to create artificial islands 
    elsewhere in the Spratlys, threatened to reverse decades of open 
    access and introduce new "rules" that required countries to 
    obtain permission before transiting those waters. 


    He said that was a worry given that 30 percent of the world's 
    trade passes through the region. 


    Asked whether China could respond to the ruling by the court of 
    arbitration in The Hague by declaring an air defense 
    identification zone, or ADIZ, as it did farther north in the East 
    China Sea in 2013, Richardson said: "Itâs definitely a concern." 


    "We will just have to see what happens," he said. "We think about 
    contingencies and ⦠responses." 


    Richardson said the United States planned to continue carrying 
    out freedom-of-navigation exercises within 12 nautical miles of 
    disputed South China Sea geographical features to underscore its 
    concerns about keeping sea lanes in the region open. 


    The United States responded to the East China Sea ADIZ by flying 
    B-52 bombers through the zone in a show of force in November 
    2013. 


    Richardson said he was struck by how China's increasing 
    militarization of the South China Sea had increased the 
    willingness of other countries in the region to work together, 
    not just bilaterally, but also multilaterally. 


    India and Japan joined the U.S. Navy in the Malabar naval 
    exercise since 2014, and were slated to take part again this year 
    in an even more complex exercise that will take place in an area 
    close to the East and South China Seas. 


    South Korea, Japan and the United States were also working 
    together more closely than ever before, he said. 


    Richardson said the United States would welcome the participation 
    of other countries in joint patrols with the United States in the 
    South China Sea, but those decisions needed to be made by the 
    countries in question. 


    He said the U.S. military saw good opportunities to build and 
    rebuild relationships with countries such as Vietnam, the 
    Philippines and India, which have all realized the importance of 
    safeguarding the freedom of the seas. 


    He cited India's recent hosting of an international fleet review 
    that included 75 ships from 50 navies, and said the United States 
    was exploring opportunities to increase its use of ports in the 
    Philippines and Vietnam, among others - including the former U.S. 
    naval base at Vietnam's Cam Ranh Bay. 


    But he said Washington needed to proceed judiciously rather than 
    charging in "very fast and very heavy," given the enormous 
    influence and importance of the Chinese economy in the region. 


    "We have to be sophisticated in how we approach this so that we 
    donât force any of our partners into an uncomfortable position 
    where they have to make tradeoffs that are not in their best 
    interest," he said. 


    "We would hope to have an approach that would ... include us a 
    primary partner but not necessarily to the exclusion of other 
    partners in the region," he said. 

The United States has seen Chinese activity... 
5 innovations in radiology that could impact everything from the Zika virus to dermatology 
Keep tabs on the latest from Business Insider in our new Chrome Extension 
Available on iOS or Android 
+0

這是一個很好的答案。但是我不想打印字符串。我想將其保存爲數據集。但是,當我將它保存回來時,我確實將unicode'u'添加到它並且不是純字符串。我如何擺脫這些? – Baktaawar

+0

你可以發佈你保存數據代碼的問題嗎? –

+0

檢查編輯您的。 – Baktaawar

0

正如你注意到get_text()消費所有的標籤和檢索它們掃到的文字。

你需要用這樣的東西來定位你的標籤。

from bs4 import BeautifulSoup 

html = ''' 
<p> 
    <img style="float:right;" src="http://static4.businessinsider.com/image/56eb68e791058427008b72e5-907-680/5550538407_c22babffba_b.jpg" alt="radar" data-mce-source="US Navy" data-mce-caption="Mineman Seaman Charles Bryan watches for contacts on the SPA 256 radar while on watch in the Combat Directive Center aboard the mine countermeasures ship USS Ardent (MCM 12)." data-link="https://www.flickr.com/photos/usnavy/5550538407/in/photolist-9stXG4-e6i1uU-e6i1tE-dLSiBQ-c9jmg7-f5LbtS-r9jw69-efvjaN-duNiV6-efpeEP-eW8Dg9-q1nZiQ-en2osX-duNiTa-njkj3s-eep3Mb-kUdU5g-9d7u4E-eeoYiC-fr2CuX-axHdte-fsVD3D-drHPeJ-9rAVac-cnMSiW-9vVcbN-enB31b-f23pKF-aBjveY-9rEhwY-9u6GZy-9rDT9L-bojAAh-9uiNiU-9AJSrB-9rFxwQ-bjkanD-aefpN9-ea2WB2-ea2WyR-a1tUoa-9rAUXZ-ea8Bf9-9Wm3Z8-9rNE7o-enB1YY-9rAUHX-ea2WpF-aNR7eD-9NX2pq" /> 
    <span class="source">US Navy</span> 
</p> 
<p> 
    The United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday. 
</p>''' 

soup = BeautifulSoup(html, "html.parser") 

print souptext.find_all('p')[1].get_text() 
+0

在你的代碼中,它只給出p [1]。對於p [0],它也會印刷美國海軍,這不是我想要的 – Baktaawar