2016-11-10 65 views
0

我有一個用例,我想用遠程文件複製到hdfs使用水槽。我還希望複製的文件應與HDFS塊大小(128MB/256MB)一致。遠程數據的總大小爲33GB。根據文件大小滾動時,需要花時間將數據複製到hdfs中

我正在使用avro源和接收器將遠程數據複製到hdfs。同樣從匯方我做文件大小滾動(128,256)。但從遠程機器複製文件並將其存儲到hdfs(文件大小128/256 MB)水槽採取平均2分鐘。

水槽配置: 的Avro源(遠程計算機)

### Agent1 - Spooling Directory Source and File Channel, Avro Sink ### 
# Name the components on this agent 
Agent1.sources = spooldir-source 
Agent1.channels = file-channel 
Agent1.sinks = avro-sink 

# Describe/configure Source 
Agent1.sources.spooldir-source.type = spooldir 
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test 


# Describe the sink 
Agent1.sinks.avro-sink.type = avro 
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine 
Agent1.sinks.avro-sink.port = 50000 

#Use a channel which buffers events in file 
Agent1.channels.file-channel.type = file 
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/ 
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/ 
Agent1.channels.file-channel.capacity = 10000000 
Agent1.channels.file-channel.transactionCapacity=50000 

# Bind the source and sink to the channel 
Agent1.sources.spooldir-source.channels = file-channel 
Agent1.sinks.avro-sink.channel = file-channel 

的Avro水槽(機,其中HDFS運行)兩機之間

### Agent1 - Avro Source and File Channel, Avro Sink ### 
# Name the components on this agent 
Agent1.sources = avro-source1 
Agent1.channels = file-channel1 
Agent1.sinks = hdfs-sink1 

# Describe/configure Source 
Agent1.sources.avro-source1.type = avro 
Agent1.sources.avro-source1.bind = xx.xx.xx.xx 
Agent1.sources.avro-source1.port = 50000 

# Describe the sink 
Agent1.sinks.hdfs-sink1.type = hdfs 
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1 
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0 
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424 
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0 
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream 
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000 
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000 
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000 
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000 
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000 


#Use a channel which buffers events in file 
Agent1.channels.file-channel1.type = file 
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir 
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir 
Agent1.channels.file-channel1.capacity = 100000000 
Agent1.channels.file-channel1.transactionCapacity=100000 


# Bind the source and sink to the channel 
Agent1.sources.avro-source1.channels = file-channel1 
Agent1.sinks.hdfs-sink1.channel = file-channel1 

網絡連接是686 Mbps的。

有人可以幫助我確定配置或備用配置中是否有問題,以便複製不會花費太多時間。

回答

1

這兩個代理都使用文件通道。因此在寫入HDFS之前,數據已被寫入磁盤兩次。您可以嘗試爲每個代理使用內存通道,以查看性能是否得到改善。

+0

我嘗試使用內存通道,看看performance.I使用文件通道,因爲它是持久通道,我想在生產中運行此。 – mandar

+0

@mandar我擔心,如果你使用文件通道,你必須面對性能下降的問題(與內存通道相比) –

相關問題