2015-10-06 40 views
0

我從http://www.millercenter.org颳了一堆講話。我的演講只是按照我想要的方式進行了剪輯和格式化,除了一小塊。每個文檔(全部911個)在開頭都有'transcript'這個詞,我不希望他們在文檔中,因爲我正在推進一些NLP。我無法刪除它們,並且我嘗試了replaceremove方法。我甚至嘗試通過HTML的一部分,在每個文檔的開頭說:<h2>Transcript</h2>延長我的find方法。網頁抓取:如果在文檔的前20個字符中刪除單詞?

這裏的樣本什麼我看,文件明智:

transcript 
to the senate and house of representatives 
i lay before congress several dispatches from his 

transcript 
the period for a new election of a citizen to administer the executive government 

這裏是我的代碼:

import urllib2,sys,os 
from bs4 import BeautifulSoup,NavigableString 
from string import punctuation as p 
from multiprocessing import Pool 
import re, nltk 
import requests 
reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','') 

就像我說的,那最後的replace方法似乎沒有工作。思考?

+0

字符串總是以''transcript''開頭嗎? – pelumi

回答

1

不知道你的問題是什麼,但是當我用python 3.4和bs4運行它時,它刪除了「成績單」以及一堆標點符號。 (我拿出了一堆包括,改變urllib2urllib.request

import urllib.request 
import re 
from bs4 import BeautifulSoup 

import re 
from string import punctuation as p 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib.request.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','') 

print(chester_3752) 
+0

它可以使我運行Python 2.7有所作爲嗎? – blacksite

+0

他們是不同的,因此這是可能的,但奇怪的是,'chester_3752 = chester_3752.replace( ' - ', ' ')'作品和'chester_3752 = chester_3752.replace(' 成績單', '')'沒有。你可能想要嘗試的另一件事是在最後一行之後放入另一行,因爲似乎很奇怪只有最後一行沒有被執行。 – dstudeba

1

我已經試過你的代碼,它工作正常,但有一個輕微的調整,我會推薦。而不是使用replace使用startswith,以確保該字符串確實有transcript開始。替換會從整個字符串中刪除全部轉錄本的出現,但是你真正需要的是在轉錄本位於字符串的開始時刪除它。

import urllib2 
import sys 
from string import punctuation as p 
import re 

reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('-',' ') 
print(chester_3752) 

# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string 

if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want 
    chester_3752 = chester_3752[10:].strip() 
print chester_3752 
+0

當我運行該程序時,在if語句中出現錯誤:'UnicodeEncodeError:'ascii'編解碼器無法在位置61344中對字符u'\ xa0'進行編碼:序號不在範圍內(128)' – blacksite

+0

if if它的抱怨是:'chester_3752 = chester_3752.replace(' - ','')'而不是從文本中刪除'transcript'的人。 – pelumi

+0

我添加了'.encode('utf-8')',並解決了這個問題。但它仍然不能爲我刪除'成績單'。我不相信'成績單'前有任何其他角色,所以這不像我們在這個詞之前缺少任何東西。 – blacksite

相關問題