2015-12-05 29 views
0
http://example.com/item/all-atv-quad.html,David,"Punjab",+123456789123 
http://example.com/item/70cc-2014.html,Qubee,"Capital",+987654321987 
http://example.com/item/quad-bike-zenith.html,Zenith,"UP",+123456789123 

我有這個test.csv,我從某些網站颳了一些項目,但東西是「數字」字段有冗餘。所以我不知何故需要刪除一個與以前一樣數量的行。這只是示例文件,在實際文件中,一些數字重複超過50次以上。根據某些領域使用python重新格式化CSV

import csv 

with open('test.csv', newline='') as csvfile: 
    csvreader = csv.reader(csvfile, delimiter=',') 

    for column in csvreader: 

     "Some logic here" 

     if (column[3] == "+123456789123"): 
      print (column[0]) 

      "or here" 

我需要重新格式化CSV這樣的:

http://example.com/item/all-atv-quad.html,David,"Punjab",+123456789123 
http://example.com/item/70cc-2014.html,Qubee,"Capital",+987654321987 

回答

2
#!/usr/bin/env python 
# -*- coding: utf-8 -*- 


import pandas as pd 


def direct(): 
    seen = set() 
    with open("test.csv") as infile, open("formatted.csv", 'w') as outfile: 
     for line in infile: 
      parts = line.rstrip().split(',') 
      number = parts[-1] 
      if number not in seen: 
       seen.add(number) 
       outfile.write(line) 


def using_pandas(): 
    """Alternatively, use Pandas""" 
    df = pd.read_csv("test.csv", header=None) 
    df = df.drop_duplicates(subset=[3]) 
    df.to_csv("formatted_pandas.csv", index=None, header=None) 


def main(): 
    direct() 
    using_pandas() 


if __name__ == "__main__": 
    main() 
1

這將過濾掉重複:

seen = set() 
for line in csvreader: 
    if line[3] in seen: 
     continue 
    seen.add(line[3]) 
    # write line to output file 

而且csv讀寫邏輯:

with open('test.csv') as fobj_in, open('test_clean.csv', 'w') as fobj_out: 
    csv_reader = csv.reader(fobj_in, delimiter=',') 
    csv_writer = csv.writer(fobj_out, delimiter=',') 
    seen = set() 
    for line in csvreader: 
     if line[3] in seen: 
      continue 
     seen.add(line[3]) 
     csv_writer.writerow(line) 
+0

你可以縮短它:'seen = set(csvreader中line的行[3])' –

+0

這假設順序並不重要。 –

+0

@PawełKordowski目的是在最後一列只寫入第一次出現值的行。只要設置'seen'就不會有用。 –