網頁抓取：如果在文檔的前20個字符中刪除單詞？

我從http://www.millercenter.org颳了一堆講話。我的演講只是按照我想要的方式進行了剪輯和格式化，除了一小塊。每個文檔（全部911個）在開頭都有'transcript'這個詞，我不希望他們在文檔中，因爲我正在推進一些NLP。我無法刪除它們，並且我嘗試了replace和remove方法。我甚至嘗試通過HTML的一部分，在每個文檔的開頭說：<h2>Transcript</h2>延長我的find方法。網頁抓取：如果在文檔的前20個字符中刪除單詞？

這裏的樣本什麼我看，文件明智：

transcript 
to the senate and house of representatives 
i lay before congress several dispatches from his

和

transcript 
the period for a new election of a citizen to administer the executive government

這裏是我的代碼：

import urllib2,sys,os 
from bs4 import BeautifulSoup,NavigableString 
from string import punctuation as p 
from multiprocessing import Pool 
import re, nltk 
import requests 
reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','')

就像我說的，那最後的replace方法似乎沒有工作。思考？

來源

2015-10-06 blacksite

字符串總是以''transcript''開頭嗎？ – pelumi

不知道你的問題是什麼，但是當我用python 3.4和bs4運行它時，它刪除了「成績單」以及一堆標點符號。（我拿出了一堆包括，改變urllib2到urllib.request）

import urllib.request 
import re 
from bs4 import BeautifulSoup 

import re 
from string import punctuation as p 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib.request.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('—',' ') 
chester_3752 = chester_3752.replace('transcript','') 

print(chester_3752)

來源

2015-10-06 03:24:25 dstudeba

它可以使我運行Python 2.7有所作爲嗎？ – blacksite

他們是不同的，因此這是可能的，但奇怪的是，'chester_3752 = chester_3752.replace（ ' - '， ' '）'作品和'chester_3752 = chester_3752.replace（' 成績單'， ''）'沒有。你可能想要嘗試的另一件事是在最後一行之後放入另一行，因爲似乎很奇怪只有最後一行沒有被執行。 – dstudeba

我已經試過你的代碼，它工作正常，但有一個輕微的調整，我會推薦。而不是使用replace使用startswith，以確保該字符串確實有transcript開始。替換會從整個字符串中刪除全部轉錄本的出現，但是你真正需要的是在轉錄本位於字符串的開始時刪除它。

import urllib2 
import sys 
from string import punctuation as p 
import re 

reload(sys) 

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752' 
chester_3752 = urllib2.urlopen(chester_url).read() 
chester_3752 = BeautifulSoup(chester_3752) 

# find the speech itself within the HTML 
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'}) 

# removes extraneous characters (e.g. '<br/>') 
chester_3752 = chester_3752.text.lower() 

# for further text analysis, remove punctuation 
punctuation = re.compile('[{}]+'.format(re.escape(p))) 

chester_3752 = punctuation.sub('', chester_3752) 
chester_3752 = chester_3752.replace('-',' ') 
print(chester_3752) 

# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string 

if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want 
    chester_3752 = chester_3752[10:].strip() 
print chester_3752

來源

2015-10-06 03:43:56 pelumi

當我運行該程序時，在if語句中出現錯誤：'UnicodeEncodeError：'ascii'編解碼器無法在位置61344中對字符u'\ xa0'進行編碼：序號不在範圍內（128）' – blacksite

if if它的抱怨是：'chester_3752 = chester_3752.replace（' - '，''）'而不是從文本中刪除'transcript'的人。 – pelumi

我添加了'.encode（'utf-8'）'，並解決了這個問題。但它仍然不能爲我刪除'成績單'。我不相信'成績單'前有任何其他角色，所以這不像我們在這個詞之前缺少任何東西。 – blacksite

網頁抓取：如果在文檔的前20個字符中刪除單詞？

回答

相關問題