在給出特定範圍的csv文件中迭代

所以我遇到的問題是我遍歷了一個非常大的csv文件。 startDate和endDate是用戶給我的輸入，我只需要在該範圍內進行搜索。在給出特定範圍的csv文件中迭代

雖然，當我運行程序到那一點時，花費很長時間才吐出「set（）」在我身上。我已經指出了我在代碼

尋找建議和可能的示例代碼的問題，謝謝大家提前！

def compare(word1, word2, startDate, endDate): 
    with open('all_words.csv') as allWords: 
     readWords = csv.reader(allWords, delimiter=',') 
     year = set() 
     for row in readWords: 
      if row[1] in range(int(startDate), int(endDate)): #< Having trouble here 
       if row[0] == word1: 
        year.add(row[1]) 
     print(year)

來源

2016-11-27 Blakester

你知道所需範圍的確切線嗎？ – amin

http://stackoverflow.com/a/29567902/1849366 –

我不阿明，我要求輸入所需的開始日期和結束日期。所以它會隨着他們輸入的內容而變化 – Blakester

你的測試沒有發現任何多年的原因是，在表達：

row[1] in range(int(startDate), int(endDate))

被檢查是否一個字符串值出現在整數列表。如果測試：

"1970" in range(1960, 1980)

你會看到它返回False。你需要寫：

int(row[1]) in range(int(startDate), int(endDate))

但是，這仍然是相當低效。它正在檢查值int(row[1])是否出現在序列[int(startDate), int(startDate)+1, ..., int(endDate)]中的任何地方，並且正在通過線性搜索進行。更快的將是：

if int(startDate) <= int(row[1]) < int(endDate):

請注意，您的代碼上面寫着排除endDate可能的日期列表（因爲範圍排除了其第二個參數），和我做同樣的上方。

編輯：其實，我想我應該指出，它只是Python 2，其中像500000 in range(1, 1000000)這樣的表達效率低下。在Python 3中（或在Python 2中用xrange代替range），速度很快。

來源

2016-11-27 07:31:59

如果您知道日期總是四位數年，則可以跳過轉換爲「int」。 – chthonicdaemon

您可以嘗試read_csv功能的熊貓圖書館。該功能允許您每次讀取所需數量的數據。所以你可以克服尺寸問題。

reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True) 

while True: 
    try: 
     df = reader.get_chunk(chunk_size) 
     # select data rows which have desired dates 
    except: 
     break 
    del df

來源

2016-11-27 07:36:49 amin

在給出特定範圍的csv文件中迭代

回答

相關問題