Python - 「撤消」文本包裝

我需要一個文本並刪除\ n字符，我相信我已經完成了。接下來的任務是從不應出現的單詞中去掉連字符，而是將連字符留在應出現的複合詞中。例如，'encyclopedia \ npedia to'encyclopedia'和'long \ nterm'改爲'long-term'。建議將其與原始文本進行比較。Python - 「撤消」文本包裝

with open('C:\Users\Paul\Desktop\Comp_Ling_Research_1\BROWN_A1_hypenated.txt', 'rU') as myfile: 
data=myfile.read().replace('\n', '')

我有一個大概的想法，但NLP對我來說是相當新的。

來源

2016-09-26 Paul Johnson

如果您的去複用單詞在有效單詞集合中，第一遍將保留一組有效單詞並去除連字符。 Ubuntu在/ usr/share/dict/american-english上有一個有效單詞列表。過於簡單的版本可能看起來像：

valid_words = set(line.strip() for line in open(valid_words_file)) 

output = [] 
for word in open(new_file).read().replace('-\n', '').replace('\n', ' ').split(): 
    if '-' in word and word.replace('-', '') in valid_words: 
     output.append(word.replace('-', '')) 
    else: 
     output.append(word)

你將不得不處理標點符號，大小寫等，但是這是想法。

來源

2016-09-26 08:43:30 hawkjo

謝謝。我正在考慮如何去連字符。從概念上講，我寫下了這些：＃如果您有兩個列表或文件，＃對於第一個列表中帶有連字符的一個項目，＃對第二個列表中的同一個項目進行編號，使用或不使用連字符。＃如果第二個列表中的項目沒有連字符，則從第一個列表中刪除連字符。 –

可能會有一個很好的參考。如果我尋找刪除連字符，我找到了簡單的方法，但不知道如何刪除基於參考列表的連字符。這感覺就像是一種反向文本包裝過程。 –

進口重新張開（ 'C：\用戶\保羅\ BROWN_A1.txt'， '的rU'）作爲truefile： true_corpus = truefile.read（） true_tokens = true_corpus.split（」「）（'C：\ Users \ Paul \ Desktop \ Comp_Ling_Research_1 \ BROWN_A1_hy penated.txt'，'rU'）as myfile： my_corpus = myfile.read（） my_tokens = my_corpus.split（''） –

-1

import re 


with open('C:\Users\Paul\BROWN_A1.txt', 'rU') as truefile: 
    true_corpus = truefile.read() 

true_tokens = true_corpus.split(' ') 

with open('C:\Users\Paul\Desktop\Comp_Ling_Research_1\BROWN_A1_hypenated.txt', 'rU') as myfile: 

my_corpus = myfile.read() 

my_tokens = my_corpus.split(' ')

來源

2016-09-26 11:33:43

這是怎麼解決你的問題的？ – alexis

Python - 「撤消」文本包裝

回答

相關問題