2017-06-16 108 views
0

令人困惑的是,每個用於數據流的Google文檔都說它現在基於Apache Beam,並將我引導到Beam網站。另外,如果我尋找github項目,我會看到谷歌數據流項目是空的,只是一切都去apache梁回購。現在說我需要創建一條管道,從我從Apache Beam讀取的內容中,我會這樣做:from apache_beam.options.pipeline_options但是,如果我使用google-cloud-dataflow,則會出現錯誤:no module named 'options',事實證明我應該使用from apache_beam.utils.pipeline_options。那麼,看起來谷歌雲數據流是與一個較舊的波束版本,並將被棄用?google-cloud-dataflow vs apache-beam

我應該選擇哪一種開發我的數據流管道?

回答

1

結束了在Google Dataflow Release Notes

The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:

  • The core SDK
  • DirectRunner and DataflowRunner
  • I/O components for other Google Cloud Platform services

The Cloud Dataflow SDK distribution does not include other Beam components, such as:

  • Runners for other distributed processing engines

  • I/O components for non-Cloud Platform services

Version 2.0.0 is based on a subset of Apache Beam 2.0.0

0

發現答案是的,我已經測試GCP之外,當最近有這個問題。這個link幫助確定你需要什麼,當談到apache梁。如果你跑下面,你將沒有GCP組件。

$ pip install apache-beam

如果你運行這個,但是你將所有的雲組件。

$ pip install apache-beam[gcp]

順便說一句,我用的是蟒蛇分佈,幾乎所有我的Python代碼和包管理。自7/20/17起,您無法使用anaconda回購安裝必要的GCP組件。希望能夠與Continuum人員合作,不僅爲Apache Beam解決問題,還解決Tensorflow問題。

相關問題