2016-01-13 89 views
1

目前我正在使用Flume版本:1.5.2。Flume在HDFS輸出文件末尾創建空行

Flume在HDFS中每個輸出文件的末尾創建一個空行,導致行計數,文件大小&校驗和不匹配源文件和目標文件。

我嘗試通過覆蓋參數roolSize,batchSize和appendNewline的默認值,但仍然無法正常工作。

而且水槽從CRLF(來源文件)改變EOL以LF(OUTPUTFILE),這也導致文件大小不同

下面是相關的水槽代理配置參數我使用

agent1.sources = c1 
agent1.sinks = c1s1 
agent1.channels = ch1 

agent1.sources.c1.type = spooldir 
agent1.sources.c1.spoolDir = /home/biadmin/flume-test/sourcedata1 
agent1.sources.c1.bufferMaxLineLength = 80000 
agent1.sources.c1.channels = ch1 
agent1.sources.c1.fileHeader = true 
agent1.sources.c1.fileHeaderKey = file 
#agent1.sources.c1.basenameHeader = true 
#agent1.sources.c1.fileHeaderKey = basenameHeaderKey 
#agent1.sources.c1.filePrefix = %{basename} 
agent1.sources.c1.inputCharset = UTF-8 
agent1.sources.c1.decodeErrorPolicy = IGNORE 
agent1.sources.c1.deserializer= LINE 
agent1.sources.c1.deserializer.maxLineLength = 50000 
agent1.sources.c1.deserializer= 
org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder 
agent1.sources.c1.interceptors = a b 
agent1.sources.c1.interceptors.a.type =  
org.apache.flume.interceptor.TimestampInterceptor$Builder 
agent1.sources.c1.interceptors.b.type = 
org.apache.flume.interceptor.HostInterceptor$Builder 
agent1.sources.c1.interceptors.b.preserveExisting = false 
agent1.sources.c1.interceptors.b.hostHeader = host 

agent1.channels.ch1.type = memory 
agent1.channels.ch1.capacity = 1000 
agent1.channels.ch1.transactionCapacity = 1000 
agent1.channels.ch1.batchSize = 1000 
agent1.channels.ch1.maxFileSize = 2073741824 
agent1.channels.ch1.keep-alive = 5 
agent1.sinks.c1s1.type = hdfs 
agent1.sinks.c1s1.hdfs.path = hdfs://bivm.ibm.com:9000/user/biadmin/ 
flume/%y-%m-%d/%H%M 
agent1.sinks.c1s1.hdfs.fileType = DataStream 
agent1.sinks.c1s1.hdfs.filePrefix = %{file} 
agent1.sinks.c1s1.hdfs.fileSuffix =.csv 
agent1.sinks.c1s1.hdfs.writeFormat = Text 
agent1.sinks.c1s1.hdfs.maxOpenFiles = 10 
agent1.sinks.c1s1.hdfs.rollSize = 67000000 
agent1.sinks.c1s1.hdfs.rollCount = 0 
#agent1.sinks.c1s1.hdfs.rollInterval = 0 
agent1.sinks.c1s1.hdfs.batchSize = 1000 
agent1.sinks.c1s1.channel = ch1 
#agent1.sinks.c1s1.hdfs.codeC = snappyCodec 
agent1.sinks.c1s1.hdfs.serializer = text 
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false 

HDFS。 serializer.appendNewline沒有解決問題。
任何人都可以檢查並建議..

回答

0

在您的水槽代理中替換下面的行。

agent1.sinks.c1s1.serializer.appendNewline = false 

以下行,讓我知道它是怎麼回事。

agent1.sinks.c1s1.hdfs.serializer.appendNewline = false 
+0

拉傑什感謝爲尋找到這一點。我仍然得到文件大小差異如下所示 – kasi

+0

我仍然得到文件大小的差異。 biadmin @ bivm:〜/ Desktop/work/flume-test/sourcedata1> hadoop fs -copyToLocal hdfs://bivm.ibm.com:9000/user/biadmin/flume/home/biadmin/Desktop/work/flume-test/sourcedata1/TermDefinition.csv.1452750041843.csv biadmin @ bivm:〜/ Desktop/work/flume-test/sourcedata1> ls -l total 8 -rw-r - r-- 1 biadmin biadmin 754 Jan 14 00: 42 TermDefinition.csv.1452750041843.csv -rwxrw-rw- 1 biadmin biadmin 767 Jan 14 00:06 TermDefinition.csv.COMPLETED biadmin @ bivm:〜/ Desktop/work/flume-test/sourcedata1> – kasi

+0

我的觀察結果是由於此文件大小有所不同,EOL正在從CRLF更改爲LF。我無法分享我的conf文件,因爲此評論部分對要填寫的no.of.characters有限制。你能否建議如何解決這個問題。 – kasi

0

更換

agent1.sinks.c1s1.hdfs.serializer = text 
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false 

agent1.sinks.c1s1.serializer = text 
agent1.sinks.c1s1.serializer.appendNewline = false 

不同的是,串行器設置沒有HDFS上設置的前綴,但直接在水槽名。

Flume文檔應該有一些例子,因爲我也遇到了問題,因爲我沒有發現序列化程序設置在不同級別的屬性名稱上。

更多HDFS匯信息可以在這裏找到: https://flume.apache.org/FlumeUserGuide.html#hdfs-sink