2016-01-13 74 views
0

我努力學習SparkCLR處理一個文本文件,並使用在其上運行火花SQL查詢Sample象下面這樣:SparkCLR:處理文本文件失敗

[Sample] 
internal static void MyDataFrameSample() 
{ 
    var schemaTagValues = new StructType(new List<StructField> 
           { 
            new StructField("tagname", new StringType()), 
            new StructField("time", new LongType()), 
            new StructField("value", new DoubleType()), 
            new StructField("confidence", new IntegerType()), 
            new StructField("mode", new IntegerType()) 
           }); 

    var rddTagValues1 = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath(myDataFile)) 
     .Map(r => r.Split('\t') 
      .Select(s => (object)s).ToArray()); 
    var dataFrameTagValues = GetSqlContext().CreateDataFrame(rddTagValues1, schemaTagValues); 
    dataFrameTagValues.RegisterTempTable("tagvalues"); 
    //var qualityFilteredDataFrame = GetSqlContext().Sql("SELECT tagname, value, time FROM tagvalues where confidence > 85"); 
    var qualityFilteredDataFrame = GetSqlContext().Sql("SELECT * FROM tagvalues"); 
    var data = qualityFilteredDataFrame.Collect(); 

    var filteredCount = qualityFilteredDataFrame.Count(); 
    Console.WriteLine("Filter By = 'confidence', RowsCount={0}", filteredCount); 
} 

但是這一直給我,說錯誤:

[2016-01-13 08:56:28,593] [8] [ERROR] [Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge] - JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=19],) 
    [2016-01-13 08:56:28,593] [8] [ERROR] [Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge] - 
    ******************************************************************************************************************************* 
     at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] parameters) in d:\SparkCLR\csharp\Adapter\Microsoft.Spark.CSharp\Interop\Ipc\JvmBridge.cs:line 91 
    ******************************************************************************************************************************* 

我的文本文件看起來像如下:

10PC1008.AA 130908762000000000   7.059829 100 0 
10PC1008.AA 130908762050000000   7.060376 100 0 
10PC1008.AA 130908762100000000   7.059613 100 0 
10PC1008.BB 130908762150000000   7.059134 100 0 
10PC1008.BB 130908762200000000   7.060124 100 0 

有什麼我用這個方法錯了嗎?

編輯1

我已進行如下設置爲我的樣本項目屬性:

enter image description here

我的用戶Environmentalvariable是如下:(不知道的事項)

enter image description here

此外我看到SparkCLRWorker登錄其無法加載組件按日誌:

[2016-01-14 08:37:01,865] [1] [ERROR] [Microsoft.Spark.CSharp.Worker] - System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. 
---> System.IO.FileNotFoundException: Could not load file or assembly 'SparkCLRSamples, Version=1.5.2.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. The system cannot find the file specified. 
     at System.Reflection.RuntimeAssembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, RuntimeAssembly locationHint, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks) 
     at System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(AssemblyName assemblyRef, Evidence assemblySecurity, RuntimeAssembly reqAssembly, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks) 
     at System.Reflection.RuntimeAssembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean forIntrospection) 
     at System.Reflection.RuntimeAssembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection) 
     at System.Reflection.Assembly.Load(String assemblyString) 
     at System.Runtime.Serialization.FormatterServices.LoadAssemblyFromString(String assemblyName) 
     at System.Reflection.MemberInfoSerializationHolder..ctor(SerializationInfo info, StreamingContext context) 
     --- End of inner exception stack trace --- 
     at System.RuntimeMethodHandle.SerializationInvoke(IRuntimeMethodInfo method, Object target, SerializationInfo info, StreamingContext& context) 
     at System.Runtime.Serialization.ObjectManager.CompleteISerializableObject(Object obj, SerializationInfo info, StreamingContext context) 
     at System.Runtime.Serialization.ObjectManager.FixupSpecialObject(ObjectHolder holder) 
     at System.Runtime.Serialization.ObjectManager.DoFixups() 
     at System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) 
     at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) 
     at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream) 
     at Microsoft.Spark.CSharp.Worker.Main(String[] args) in d:\SparkCLR\csharp\Worker\Microsoft.Spark.CSharp\Worker.cs:line 149 

回答

0

你指定樣本數據的位置和你的源文本文件複製到該位置?如果沒有,你可以參考

https://github.com/Microsoft/SparkCLR/blob/master/csharp/Samples/Microsoft.Spark.CSharp/samplesusage.md

設置與參數[--data您的樣本數據的位置| sparkclr.sampledata.loc。

+0

是的,我的命令行參數就像這個'--torun「MyDataFrameSample」--data D:\ SparkCLR \ build \ run \ data',文件存在那裏。日誌顯示這個'16/01/13 12:14:14信息HadoopRDD:輸入split:文件:/ D:/SparkCLR/build/run/data/data_small.txt:0 + 75981' – Kiran

0

嘗試明確設置[--temp | spark.local.dir]選項(有關受支持參數的更多信息,請參閱sampleusage.md)。 SparkCLR工作者可執行文件在執行時被下載到這個目錄中。如果您使用默認臨時目錄,那麼工作程序可執行文件可能會被您的防病毒軟件隔離,將其誤認爲您的瀏覽器下載的某些惡意程序。覆蓋默認值如c:\ temp \ SparkCLRTemp將有助於避免該問題。

如果設置臨時目錄沒有幫助,請在啓動SparkCLR驅動程序代碼時共享您正在使用的整個命令行參數列表。

+0

我已更新我的初始文章在'Edit 1'下有更多的細節,並且試圖按照你的建議設置--temp,但是我仍然無法使它工作。還有什麼我可以看看。還有任何想法,爲什麼我看到'FileNotFoundException:無法加載文件或程序集'SparkCLRSamples'' - Regars – Kiran

+0

它看起來像你試圖在調試模式下運行SparkCLR。有關說明,請訪問https://github.com/Microsoft/SparkCLR/blob/master/notes/windows-instructions.md#debugging-tips。如上所述,您需要設置CSharpBackendPortNumber和CSharpWorkerPath配置值。 – skaarthik

+0

我已經按照這些說明進行了調試,並且沒有問題。 - 謝謝 – Kiran

0

這是你如何更改端口號,我希望這有助於在應用中幫助

。配置添加以下

爲了完整起見,還必須添加指定的csharpworker

<appSettings> 
    <add key="CSharpBackendPortNumber" value="num"/> 
    <add key="CSharpWorkerPath" value="C:\MobiusRelease\samples\CSharpWorker.exe"/> 
</appSettings> 

通知路徑的標籤,以使在調試模式下這項工作,你應該先使用運行此命令從(墨比爾斯家)命令行, 目錄

%SPARKCLR_HOME%\腳本

運行

sparkclr-submit.cmd debug 

這會給你這樣一個包含端口號

消息

[CSharpRunner.main]由CSharpBackend使用的端口號爲5567
* [CSharpRunner.main]後臺運行調試模式。按回車鍵退出*