2017-05-31 40 views
1

我有一個包含兩列的CSV文件包含句子。例如 Test.csv:如何在csv文件中幹掉每一行?

Col[1] 
---------------------- 
This trip was amazing. 

Col[2] 
-------------------- 
The cats are playing. 

所以我做了一些NLP過程:

with codecs.open('test.csv','r', encoding='utf-8', errors='ignore') as myfile: 
    data = csv.reader(myfile, delimiter=',') 
    next(data) 
    stops = set(stopwords.words("english")) 
    stemmer = PorterStemmer() 
    for row in data: 
     word_tokens1 = word_tokenize(row[1].lower()) 
     word_tokens2 = word_tokenize(row[2].lower()) 
     remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]"," ",w)] 
     remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]"," ",w)] 
     list1 = [w for w in remo1 if not w in stops] 
     list2 = [w for w in remo2 if not w in stops] 
     for w in list1: 
      l = stemmer.stem(w) 
      print(l) 
     for w in list2: 
      l2 = stemmer.stem(w) 
      print(l2) 

我的問題是,當我不制止,當我打印:

trip 
amazi 
cat 
play 

它連續打印每個單詞。我怎樣才能制止 等之後返回來了一句:

Col[1]: 
------------------- 
trip amazi 

Col[2]: 
------------------- 
cat play 
+0

您可以顯示文件的示例嗎?我想知道你爲什麼使用csv軟件包。據我所知,你關心的是行。在csv中,列之間用逗號分隔。行由換行符分隔。 – MAZDAK

+0

它是在不同的顏色對不起,我寫它作爲代碼.. –

+0

因此,每條線看起來像「這次旅行是驚人的,貓在玩」? – MAZDAK

回答

0

這裏是你的代碼的修改版本,產生所需的輸出。你所要做的最重要的事情正在發生變化

for w in list1: 
      l = stemmer.stem(w) 
      print(l) 
     for w in list2: 
      l2 = stemmer.stem(w) 
      print(l2) 

stemmed_first = "" 
      c = 0 
      for w in list1: 
       if c < len(list1)-1: 
        stemmed_first += stemmer.stem(w) + " " 
       else: 
        stemmed_first += stemmer.stem(w) 
       c += 1 

與同爲list2。但是,我在您的代碼中做了其他小的更改:

stemmer = PorterStemmer() 
stops = set(stopwords.words("english")) 

with open('test.csv', 'rb') as csvfile: 
    spamreader = csv.reader(csvfile, delimiter=',') 

    for row in spamreader: 
     if len(row) >= 2: 
      word_tokens1 = nltk.tokenize.word_tokenize(row[0]) 
      word_tokens2 = nltk.tokenize.word_tokenize(row[1]) 
      remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]", " ", w)] 
      remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]", " ", w)] 
      list1 = [w for w in remo1 if not w in stops] 
      list2 = [w for w in remo2 if not w in stops] 

      stemmed_first = "" 
      c = 0 

      for w in list1: 
       if c < len(list1)-1: 
        stemmed_first += stemmer.stem(w) + " " 
       else: 
        stemmed_first += stemmer.stem(w) 
       c += 1 

      stemmed_second = "" 
      c = 0 

      for w in list2: 
       if c < len(list2)-1: 
        stemmed_second += stemmer.stem(w) + " " 
       else: 
        stemmed_second += stemmer.stem(w) 
       c += 1 

      print stemmed_first 
      print stemmed_second 
相關問題