1

我讓我的學生使用bdutil來創建Google Compute Engine集羣,其中包含持久磁盤和HDFS作爲默認文件系統。我們希望擁有永久磁盤,以便學生能夠在數週內完成項目。但是,在重新部署集羣之後,HDFS似乎不可用。在Google Cloud中維護持久HDFS

我的問題真的是「如何在羣集的重新部署中維護持久的HDFS文件系統?」

這是我曾嘗試

一切正常,在初始部署,創建永久磁盤。我創建一個目錄用命令

$ hadoop fs -mkdir /foo 
$ hadoop fs –put foo.txt /foo/foo.txt 
$ hadoop fs –cat /foo/foo.txt 
foo 

我然後刪除和redeply集羣與DELETE_ATTACHED_PDS_ON_DELETE=falseCREATE_ATTACHED_PDS_ON_DEPLOY=false保持跨越調動

永久磁盤當我ssh到重新部署的集羣中,我可以看到該文件我創建

$ hadoop fs –ls /foo 
Found 1 items 
-rw-r--r-- 3 mpcs supergroup   4 2014-10-01 13:16 /foo/foo.txt 

但是,任何試圖訪問該文件的內容會失敗:

$ hadoop fs –cat /foo/foo.txt 
cat: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Zero blocklocations for /foo/foo.txt. Name node is in safe mode 

手動關閉安全的代碼並不能幫助

$ hadoop dfsadmin -safemode leave 
Safe mode is OFF 
$ hadoop fs –cat /foo/foo.txt 
14/10/01 13:31:20 INFO hdfs.DFSClient: No node available for: blk_2908405986797013125_1002 file=/foo/foo.txt 
14/10/01 13:31:20 INFO hdfs.DFSClient: Could not obtain blk_2908405986797013125_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 
*etc* 

關於如何創建一個HDFS存儲可以跨重新部署集羣將大大堅持任何建議表示讚賞

感謝,

Mike

回答

3

感謝您的詳細報告!事實上,看起來你已經發現了一個引入了幾個版本的bug,其中bdutil-0.35.2/libexec/configure_hadoop.sh不幸地在數據節點數據目錄上對目錄權限進行了修改,其中775的默認值爲755(對於Hadoop 1.2.1),或者700(用於Hadoop 2.4.1)。這將導致數據節點永遠不會在重啓恢復,打印:

2014-10-01 20:37:59,810 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in dfs.data.dir: Incorrect permission for /mnt/pd1/hadoop/dfs/data, expected: rwxr-xr-x, while actual: rwxrwxr-x 
2014-10-01 20:37:59,811 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: All directories in dfs.data.dir are invalid. 

一個短期的解決方法,你可以,如果你正在使用Hadoop 1.2直接破重新部署的集羣上運行。1是運行:

./bdutil run_command -t all -- "chmod 755 /hadoop/dfs/data" 
./bdutil run_command -t master -- "sudo -u hadoop /home/hadoop/hadoop-install/bin/stop-dfs.sh" 
./bdutil run_command -t master -- "sudo -u hadoop /home/hadoop/hadoop-install/bin/start-dfs.sh" 

事實證明,Hadoop的2實際上是由具有DataNode會先走一步,設置權限,它需要,如果它不匹配已經解決了這個問題:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.2.0/org/apache/hadoop/util/DiskChecker.java#130

120 public static void More ...mkdirsWithExistsAndPermissionCheck(
121  LocalFileSystem localFS, Path dir, FsPermission expected) 
122  throws IOException { 
123 File directory = localFS.pathToFile(dir); 
124 boolean created = false; 
125 
126 if (!directory.exists()) 
127  created = mkdirsWithExistsCheck(directory); 
128 
129 if (created || !localFS.getFileStatus(dir).getPermission().equals(expected)) 
130  localFS.setPermission(dir, expected); 
131 } 

雖然Hadoop的1只撈出:

https://github.com/apache/hadoop/blob/release-1.2.1/src/core/org/apache/hadoop/util/DiskChecker.java#L106

private static void checkPermission(Path dir, 
            FsPermission expected, FsPermission actual) 
throws IOException { 
    // Check for permissions 
    if (!actual.equals(expected)) { 
    throw new IOException("Incorrect permission for " + dir + 
          ", expected: " + expected + ", while actual: " + 
          actual); 
    } 

} 

我們將在接下來的bdutil發行有明確設置dfs.datanode.data.dir.perm但在此期間來解決這個,你也可以用打補丁的patch bdutil-0.35.2/libexec/configure_hdfs.sh tmpfix.diff以下的硬編碼的解決方法:

43a44,46 
> # Make sure the data dirs have the expected permissions. 
> chmod 755 ${HDFS_DATA_DIRS} 
> 

另外,如果你'與-e hadoop2_env.sh一起使用bdutil,那麼HDFS持久性應該已經可以不作任何進一步的修改。