從列表中刪除標點符號

我正在爲語義分析設置一些可用的數據。我有我正在迭代的原始文本數據的語料庫。我打開數據，以字符串的形式讀取數據，將數據拆分成一個列表，然後在稍後的函數中準備將數據內置到數據集中。但是，當我構建數據集時，我最常見的詞語是標點符號。在進一步處理數據之前，我需要從列表中刪除所有標點符號。從列表中刪除標點符號

import os 
import collections 
import string 
import sys 

import tensorflow as tf 
import numpy as np 
from six.moves import xrange 


totalvocab = [] 

#Loop for: loop through all files in 'Data' directory 
for subdir, dirs, files in os.walk('Data'): 
for file in files: 
    filepath = subdir + os.sep + file 
    print(filepath) 

    #Function for: open file, convert input to string, split into list 
    def read_data(filepath): 
     with open(filepath, 'r') as f: 
      data = tf.compat.as_str(f.read()).split() 
     return data 

    #Run function on data, add file data to full data set. 
    filevocab = read_data(filepath) 
    totalvocab.extend(filevocab) 

    filevocab_size = len(filevocab) 
    print('File vocabulary size: %s' % filevocab_size) 
    totalvocab_size = len(totalvocab) 
    print('Total vocabulary size: %s' % totalvocab_size)

如果我做到以下幾點：

def read_data(filepath): 
     with open(filepath, 'r') as f: 
      data = tf.compat.as_str(f.read()) 
      data.translate(string.punctuation) 
      data.split() 
     return data

這句話被分成多個字母。我嘗試過的任何其他方法都出錯了。

來源

2017-04-16 Sabolis

有一對夫婦在代碼中的錯誤：

str.split()和str.translate()不到位修改。
str.translate()需要映射。

要解決：

def read_data(filepath): 
    with open(filepath, 'r') as f: 
     data = tf.compat.as_str(f.read()) 
    data = data.translate(str.maketrans('', '', string.punctuation)) 
    return data.split()

刪除標點符號，可能會或你想要的東西可能不會做，例如帶連字符的單詞將連接在一起。您可以選擇標識您將用空格替換的標點符號。

來源

2017-04-16 05:10:42 AChampion

非常感謝！這正是我所需要的。而且您在預測我未來的需求時也是正確的，因爲需要在我的數據中包含帶連字符的單詞。我將如何去宣佈要替換哪個標點符號？ – Sabolis

從列表中刪除標點符號

回答

相關問題