0
我編寫了從語料庫中提取單詞的代碼,然後對它們進行標記並與句子進行比較。輸出是Bag of Words(如果單詞在句子1中,如果不是0)。將字符串劃分爲python
import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown
news = brown.words(categories='news')
news_sents = brown.sents(categories='news')
fdist = FreqDist(w.lower() for w in news)
vocabulary = [word for word, _ in fdist.most_common(100)]
num_sents = len(news_sents)
for i in range(num_sents):
features = {}
for word in vocabulary:
features[word] = int(word in news_sents[i])
bow = "".join(str(n) for n in list(features.values()))
f = open("D:\\test\\Vector.txt", "a")
print(bow, file=f)
f.close()
在這種情況下,輸出字符串的長度爲100個字符。我想將它分割成任意長度的塊,併爲其分配塊數。例如:
print(i+1, chunk_id, bow, sep="\t", end="\n", file=f)
其中i + 1是句號。爲了展示我的意思,讓我們取長度爲12 >>「110010101111」和「011011000011」的字符串。它應該看起來像:
1 1 1100
1 2 0101
1 3 1111
2 1 0110
2 2 1100
2 3 0011
的重複數據刪除技術在談論名單,但解決方案將字符串工作了。 – timgeb