2017-05-07 66 views
0

我需要解析csv文件。CSV讀取列的值

輸入:文件名+

Index | writer | year | words 
    0  | Philip | 1994 | this is first row 
    1  | Heinz | 2000 | python is wonderful (new line) second line 
    2  | Thomas | 1993 | i don't like this 
    3  | Heinz | 1898 | this is another row 
    .  |  .  | . |  . 
    .  |  .  | . |  . 
    N  | Fritz | 2014 | i hate man united 

輸出:對應所有單詞列表來命名

l = ['python is wonderful second line', 'this is another row'] 

我有什麼企圖?

import csv 
import sys 

class artist: 
    def __init__(self, name, file): 
     self.file = file 
     self.name = name 
     self.list = [] 

    def extractText(self): 
     with open(self.file, 'rb') as f: 
      reader = csv.reader(f) 
      temp = list(reader) 
     k = len(temp) 
     for i in range(1, k): 
      s = temp[i] 
      if s[1] == self.name: 
       self.list.append(str(s[3])) 


if __name__ == '__main__': 
    # arguements 
    inputFile = str(sys.argv[1]) 
    Heinz = artist('Heinz', inputFile) 
    Heinz.extractText() 
    print(Heinz.list) 

輸出是:

["python is wonderful\r\nsecond line", 'this is another row'] 

如何獲取包含單詞的多行細胞擺脫\r\n,並且可以循環作爲其極其緩慢得到改善呢?

回答

1

這至少應該更快,因爲你正在分析你正在閱讀的文件,然後剝離掉不需要的回車和換行字符,如果它們的存在。

with open(self.file) as csv_fh: 
    for n in csv.reader(csv_fh): 
     if n[1] == self.name: 
      self.list.append(n[3].replace('\r\n', ' ') 
1

你可以簡單地使用大熊貓以獲取列表:

import pandas 
df = pandas.read_csv('test1.csv') 
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index 
l = list() 
for i in index: 
    l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list. 
l 

輸出:

['python is wonderful second line', 'this is another row'] 
+0

這不是我想要的。我需要一個特定的作家/藝術家的話。不是所有的單詞。 –

+0

@TonyTannous更新了特定的作家答案。 –

1

入門中s[3]擺脫換行:我建議' '.join(s[3].splitlines())。見單證爲"".splitlines,又見"".translate

改善循環:

def extractText(self): 
    with open(self.file, 'rb') as f: 
     for s in csv.reader(f): 
      s = temp[i] 
      if s[1] == self.name: 
       self.list.append(str(s[3])) 

這節省了一個傳過來的數據。

但請考慮@ Tiny.D的意見,並給大熊貓一個嘗試。

+0

但他們我有刪除一些行前舉行中的每個對象全部文本。不是嗎?我需要的不是所有的特定單詞。 –

+0

原始代碼複製所有文件內容存儲在存儲器'臨時=列表(讀取器)';這裏每一行檢查S [1] == self.name;大多數線路被丟棄。 – tiwo

0

要摺疊多個白色空間,您可以使用正則表達式,並加快了一點東西,嘗試循環理解:

import re 

def extractText(self): 
    RE_WHITESPACE = re.compile(r'[ \t\r\n]+') 
    with open(self.file, 'rU') as f: 
     reader = csv.reader(f) 

     # skip the first line 
     next(reader) 

     # put all of the words into a list if the artist matches 
     self.list = [RE_WHITESPACE.sub(' ', s[3]) 
        for s in reader if s[1] == self.name]