我從http://www.millercenter.org颳了一堆講話。我的演講只是按照我想要的方式進行了剪輯和格式化,除了一小塊。每個文檔(全部911個)在開頭都有'transcript'這個詞,我不希望他們在文檔中,因爲我正在推進一些NLP。我無法刪除它們,並且我嘗試了replace
和remove
方法。我甚至嘗試通過HTML的一部分,在每個文檔的開頭說:<h2>Transcript</h2>
延長我的find
方法。網頁抓取:如果在文檔的前20個字符中刪除單詞?
這裏的樣本什麼我看,文件明智:
transcript
to the senate and house of representatives
i lay before congress several dispatches from his
和
transcript
the period for a new election of a citizen to administer the executive government
這裏是我的代碼:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
就像我說的,那最後的replace
方法似乎沒有工作。思考?
字符串總是以''transcript''開頭嗎? – pelumi