AWS膠水需要很長的時間才能完成

我只是運行一個非常簡單的工作如下AWS膠水需要很長的時間才能完成

glueContext = GlueContext(SparkContext.getOrCreate()) 
l_table = glueContext.create_dynamic_frame.from_catalog(
      database="gluecatalog", 
      table_name="fctable") 
l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code') 
print "Count: ", l_table.count() 
l_table.printSchema() 
l_table.select_fields(['trans_time']).toDF().distinct().show() 
dfc = l_table.relationalize("table_root", "s3://my-bucket/temp/") 
print "Before keys() call " 
dfc.keys() 
print "After keys() call " 
l_table.select_fields('table').printSchema() 
dfc.select('table_root_table').toDF().where("id = 1 or id = 2").orderBy(['id','index']).show() 
dfc.select('table_root').toDF().where("table = 1 or table = 2").show()

的數據結構也很簡單

root 
|-- table: array 
| |-- element: struct 
| | |-- trans_time: string 
| | |-- seq: null 
| | |-- operation: string 
| | |-- order_date: string 
| | |-- order_code: string 
| | |-- tbl_code: string 
| | |-- ship_plant_code: string 
|-- partition_0 
|-- partition_1 
|-- partition_2 
|-- partition_3

當我運行作業測試，它從任何地方了12到16分鐘完成。但云觀察日誌顯示，該作業花費了2秒鐘來顯示我的所有數據。

所以我的問題是： AWS Glue工作花費超出日誌記錄的時間可以顯示，並且它是在日誌記錄期外進行的工作？

來源

2017-08-29 Shawn

需要花時間設置允許代碼運行的環境。我遇到了同樣的問題，聯繫了AWS GLUE團隊，他們很有幫助。這需要很長時間的原因是，如果您在一小時內運行同一個腳本兩次或運行其他任何腳本，則當您運行第一個作業（保持活動1小時）時，GLUE會構建一個環境，下一個作業將花費大量時間。當你運行第一個腳本時，他們稱之爲冷啓動，第一個工作花了17分鐘，第一個工作完成後我又跑了同樣的工作，只花了3分鐘。

來源

2017-10-23 20:03:43

在執行編輯作業的操作時，可以在「腳本庫和作業參數（可選）」部分下添加更多的DPU。它有助於一些，但不要指望有任何重大改進，我的經驗。

來源

2017-12-05 23:35:33 Jie

AWS膠水需要很長的時間才能完成

回答

相關問題