將數據寫入每個數據節點中的本地磁盤

我想將map任務中的某些值存儲到每個數據節點中的本地磁盤中。例如，將數據寫入每個數據節點中的本地磁盤

public void map (...) { 
    //Process 
    List<Object> cache = new ArrayList<Object>(); 
    //Add value to cache 
    //Serialize cache to local file in this data node 
}

我怎麼能存儲此緩存對象中的每個數據節點的本地磁盤，因爲如果我的map函數存儲這個緩存像上面，那麼性能將是可怕的，因爲I/O任務？

我的意思是有什麼辦法可以等待這個數據節點中的map任務完全運行，然後我們將這個緩存存儲到本地磁盤？或者Hadoop是否有解決此問題的功能？

來源

2016-05-17 nd07

請參閱下面的答案，希望它有助於。 –

請看下面的例子，創建的文件將在NodeManager使用的容器的目錄下的某個地方。這是配置屬性yarn.nodemanager.local - 迪爾斯紗線-site.xml中，或者從紗default.xml中繼承的默認，這是下/tmp

Please see @Chris Nauroth answer, Which says that Its just for debugging purpose and It's not recommended as a permanent production configuration. It was clearly described why it was not recommended.

public void map(Object key, Text value, Context context) 
     throws IOException, InterruptedException { 
    // do some hadoop stuff, like counting words 
    String path = "newFile.txt"; 
    try { 
     File f = new File(path); 
     f.createNewFile(); 
    } catch (IOException e) { 
     System.out.println("Message easy to look up in the logs."); 
     System.err.println("Error easy to look up in the logs."); 
     e.printStackTrace(); 
     throw e; 
    } 
}

來源

2016-05-17 09:38:14

感謝您指出如何在數據節點中創建本地文件。但是如何將這個文件序列化到數據節點，就像我在我的問題中所做的那樣。如果我們在map函數（）中序列化它，那麼例如，如果inputsplit有1000條記錄，那麼程序將調用1000次serialize funciton（）？當任務函數（）在每個節點完成完成時，是否有任何序列化對象 – nd07

正如我所瞭解的，您想要序列化1000個記錄或通過映射方法處理的記錄數。我認爲你可以在安裝程序中打開文件句柄並在清理方法中關閉。在地圖方法中，您可以在追加模式下編寫所有記錄。這對你的那種要求會好嗎？再次！ Chris Nauroth答案中提到的幾點是適用的。你可以試試這個。 Thx –

感謝您的支持！ – nd07

將數據寫入每個數據節點中的本地磁盤

回答

相關問題