如何加速讀取2個tsv文件並將它們作爲Python中的JSON文件編寫爲

我有一個包含13列的tweets tsv文件，每行代表一條推文（總計〜300M推文）。另有另一個tsv文件，包含3個列，包括userinfo（〜500Mb）。我需要讀取這兩個文件，並將tweet文件的所有列和userinfo文件的第二列放在一起，並將它們保存爲每條推文的JSON文件。我寫了下面的代碼，但它非常慢。我想知道是否有辦法加快速度。如何加速讀取2個tsv文件並將它們作爲Python中的JSON文件編寫爲

這裏是我的代碼：

t_dic = {} 
with open("~/userinfo_file.txt") as f: 
    for line in f: 
     data = line.split('\t') 
     uid = data[0] 
     user_info = data[1] 
     user_info_dic = json.loads(user_info) 
     t_dic[uid] = user_info_dic 
f.close() 


with open('~/tweets_file.txt') as f: 
    for data in f: 
     line = data.split('\t') 
     user_dic = {} 
     uid = line[0].strip() 
     if uid in t_dic.keys(): # check whether the user is in my userinfo list 
      user_dic['id'] = line[1].strip() 
      user_dic['id_str'] = str(line[1].strip()) 
      user_dic['text'] = line[2].decode('utf-8').strip() 
      user_dic['created_at'] = line[3].strip() 
      user_dic['user'] = t_dic[uid] # here Im using the above dic which I created based on the userinfo file 

      with io.open("~/tweet_dir/{}.json".format(user_dic['id']),'a') as f2: 
       f2.write(unicode(json.dumps(user_dic))) 
      f2.close() 
f.close()

來源

2015-11-08 msmazh

爲每條推文創建不同的文件在時間和空間上都很昂貴（每個文件通常佔用4Kb）。更好的設計是將所有內容放在單個數據庫表中。您使用哪種操作系統/文件系統？爲什麼你需要在不同的文件中推文？ –

我正在使用一個包需要JSON文件中的每條推文的包。我在Ubuntu上。 – msmazh

您是否嘗試過在沒有創建文件的情況下運行它？這應該告訴我們，如果那是緩慢的部分。你需要這個文件嗎？ – memoselyk

唯一低效，我可以當場爲：

if uid in t_dic.keys(): # check whether the user is in my userinfo list

注意與差異，

if uid in t_dic: # check whether the user is in my userinfo list

第一個是在項目列表中進行搜索，而搜索項目並不是最優化的，而後者是海洋在一個應該有更好的搜索性能的集合中進行切分。

來源

2015-11-09 02:37:47 memoselyk

非常感謝。這顯着加快了這一過程。 – msmazh

可以使用Python profilerscProfile和profile分析您的程序，每行花費的時間。使用你的腳本
運行cProfile：

python -m cProfile [-o output_file] [-s sort_order] myscript.py

來源

2015-11-08 14:01:23

如何加速讀取2個tsv文件並將它們作爲Python中的JSON文件編寫爲

回答

相關問題