GAE Python LXML - 超出軟內存限制

我正在讀取GZipped LXML文件並嘗試將產品條目寫入數據庫模型。以前我遇到了本地內存問題，這些問題通過SO（question）上的幫助解決。現在，我得到的一切工作，並部署它，但是我得到以下錯誤在服務器上：GAE Python LXML - 超出軟內存限制

Exceeded soft private memory limit with 158.164 MB after servicing 0 requests total

現在，我嘗試了所有我知道，以減少內存使用情況，我目前使用下面的代碼。 GZipped文件約爲7 MB，而解壓縮爲80 MB。本地代碼工作正常。我試着將它作爲HTTP請求以及Cron Job運行，但它沒有什麼區別。現在我想知道是否有辦法讓它更有效率。

關於SO的一些類似的問題涉及前端和後端規範，這是我不熟悉的。我正在運行GAE的免費版本，並且此任務必須每週運行一次。任何建議，最好的方式前進將非常感激。

from google.appengine.api.urlfetch import fetch 
import gzip, base64, StringIO, datetime, webapp2 
from lxml import etree 
from google.appengine.ext import db 

class GetProductCatalog(webapp2.RequestHandler): 
    def get(self): 
    user = XXX 
    password = YYY 
    url = 'URL' 

    # fetch gziped file 
    catalogResponse = fetch(url, headers={ 
     "Authorization": "Basic %s" % base64.b64encode(user + ':' + password) 
    }, deadline=10000000) 

    # the response content is in catalogResponse.content 
    # un gzip the file 
    f = StringIO.StringIO(catalogResponse.content) 
    c = gzip.GzipFile(fileobj=f) 
    content = c.read() 

    # create something readable by lxml 
    xml = StringIO.StringIO(content) 

    # delete unnecesary variables 
    del f 
    del c 
    del content 

    # parse the file 
    tree = etree.iterparse(xml, tag='product') 

    for event, element in tree: 
     if element.findtext('manufacturer') == 'New York': 
      if Product.get_by_key_name(element.findtext('sku')): 
        coupon = Product.get_by_key_name(element.findtext('sku')) 
        if coupon.last_update_prov != datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y"): 
         coupon.restaurant_name = element.findtext('name') 
         coupon.restaurant_id = '' 
         coupon.address_street = element.findtext('keywords').split(',')[0] 
         coupon.address_city = element.findtext('manufacturer') 
         coupon.address_state = element.findtext('publisher') 
         coupon.address_zip = element.findtext('manufacturerid') 
         coupon.value = '$' + element.findtext('price') + ' for $' + element.findtext('retailprice') 
         coupon.restrictions = element.findtext('warranty') 
         coupon.url = element.findtext('buyurl') 
         if element.findtext('instock') == 'YES': 
          coupon.active = True 
         else: 
          coupon.active = False 
         coupon.last_update_prov = datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y") 
         coupon.put() 
        else: 
         pass 
      else: 
        coupon = Product(key_name = element.findtext('sku')) 
        coupon.restaurant_name = element.findtext('name') 
        coupon.restaurant_id = '' 
        coupon.address_street = element.findtext('keywords').split(',')[0] 
        coupon.address_city = element.findtext('manufacturer') 
        coupon.address_state = element.findtext('publisher') 
        coupon.address_zip = element.findtext('manufacturerid') 
        coupon.value = '$' + element.findtext('price') + ' for $' + element.findtext('retailprice') 
        coupon.restrictions = element.findtext('warranty') 
        coupon.url = element.findtext('buyurl') 
        if element.findtext('instock') == 'YES': 
         coupon.active = True 
        else: 
         coupon.active = False 

        coupon.last_update_prov = datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y") 
        coupon.put() 
     else: 
      pass 

     element.clear()

UDPATE

根據保羅的建議，我實現了後臺。經過一些麻煩，它像一個魅力 - 找到我在下面使用的代碼。

我backends.yaml如下所示：

backends: 
- name: mybackend 
    instances: 10 
    start: mybackend.app 
    options: dynamic

而我的app.yaml如下：

handlers: 
- url: /update/mybackend 
    script: mybackend.app 
    login: admin

來源

2013-02-22 Vincent

start: mybackend.app你經常需要從導入數據xml文件。如果偶爾您會發現使用remote_api更容易，並在本地處理文件並直接寫入數據存儲區。然後完整可以像你本地機器可以處理的一樣大。 – 2013-02-23 11:36:05

還要注意'德爾C'可能什麼都不會做，除非你明確調用GC.Collect（）作爲東西不會可能會被收集了相當一段時間。另外，也要看看你的代碼，你必須讀文件/ StringIO的，XML（這是StringIO的包裝的C版本），然後全面分析樹。你說這是80MB的壓縮文件，至少有一份你還沒有加過樹。您可以考慮使用拉解析策略，這將意味着你沒有在內存中全面分析樹副本以及字符串。 – 2013-02-23 11:41:44

謝謝Tim的意見。是的，我確實考慮過remote_api選項，但是在某些時候，這個腳本將以每日運行速度運行，這就是我選擇當前設置的原因。我會研究你對拉分析策略的建議，看看它能否改善性能。再次感謝！ – Vincent 2013-02-23 16:14:59

後端是像前端實例，但他們沒有規模，你必須根據需要停止和啓動它們（或者將它們設置爲動態，可能是您最好的選擇）。

您可以在後端擁有多達1024MB的內存，因此它可能對您的任務正常工作。

https://developers.google.com/appengine/docs/python/backends/overview

App Engine Backends are instances of your application that are exempt from request deadlines and have access to more memory (up to 1GB) and CPU (up to 4.8GHz) than normal instances. They are designed for applications that need faster performance, large amounts of addressable memory, and continuous or long-running background processes. Backends come in several sizes and configurations, and are billed for uptime rather than CPU usage.

A backend may be configured as either resident or dynamic. Resident backends run continuously, allowing you to rely on the state of their memory over time and perform complex initialization. Dynamic backends come into existence when they receive a request, and are turned down when idle; they are ideal for work that is intermittent or driven by user activity. For more information about the differences between resident and dynamic backends, see Types of Backends and also the discussion of Startup and Shutdown.

這聽起來像你需要的東西。免費使用級別也適用於您的任務。

來源

2013-02-22 12:24:58

非常感謝保羅。我會嘗試一下，一旦有了它，我們會發布反饋。 – Vincent 2013-02-22 12:31:28

所以我試圖實現它，但它不認識backends.yaml。我已更新問題描述。我錯過了明顯的東西嗎？今天晚些時候我會做更多的測試。謝謝你的幫助！ – Vincent 2013-02-22 13:04:02

這是一個完整的其他問題。建議你開始一個不同的問題和/或檢查一些類似的問題。例如。你有沒有開始後端？ – 2013-02-22 13:59:55

關於後端：看您所提供的例子 - 好像你的要求是簡單地通過前端實例處理。

爲了能讓它在後臺進行處理，嘗試而不是調用類的任務：http://mybackend.my_app_app_id.appspot.com/update/mybackend

另外，我覺得你可以刪除：從您的backends.yaml

來源

2013-02-22 13:33:17 stachern

感謝您的建議stachern。事實證明，我不得不通過「appcfg後端

更新[後端]」 – Vincent 2013-02-22 16:38:17

啓動後端。您是對的，定期部署不會影響後端，後者基本上視爲應用程序的單獨版本。 – stachern 2013-02-22 20:47:17

GAE Python LXML - 超出軟內存限制

回答

相關問題