2016-12-24 63 views
1

這是我的代碼片段。我正在嘗試使用Scrapy刮取網站,然後將數據存儲在Elasticsearch中以進行索引。Scrapy:如何清理響應?

def parse(self, response): 
    for news in response.xpath('head'): 
     yield { 
      'pagetype': news.xpath('//meta[@name="pagetype"]/@content').extract(), 
      'description': news.xpath('//div[@class="module__content"]/*/node()/text()').extract(), 
       } 

現在我的問題是保存在'description'字段中的值。

[u'\n    \n    ', u'"For\n    many of us what we eat on Christmas day isn\'t what we would usually consume and\n    that\u2019s perfectly ok," Dr said.', u'"However\n    it is not uncommon for festive season celebrations to begin in November and\n    continue well in to the New Year.', u'"So\n    if health is on the agenda, being mindful about what we put into our bodies\n    with a balanced approach, throughout the whole festive season, is important."', u"Dr\n    , a lecturer at School\n    Sciences, said balancing fresh, healthy food with being physically active was a\n    good start.", u'"Whatever\n    the celebration, try to limit processed foods, often high in fat, sugar and\n    salt," she said.', u'"Taking\n    time during holidays to prepare food and make the most of fresh ingredients is\n    often a much healthier option than relying on convenience foods and take away.', u'"Being\n    mindful about going back for seconds is important too.\xa0 We don\u2019t need to eat until we feel\n    uncomfortable and eating the foods we enjoy doesn\'t necessarily mean we need to\n    eat copious amounts."', u"Dr\n    own healthy tips and substitutes for the Christmas season\n    include:", u'But\n    just because Dr is a dietitian, doesn\u2019t mean she doesn\u2019t enjoy a\n    Christmas treat or two.', u'"I\n    would have to say my sister in law\'s homemade rocky road is my favourite\n    festive treat. She makes it every Christmas day and it gets better each year," she\n    said.', u'"I\n    also enjoy a summer cocktail every so often during the festive season and a\n    mojito would be one of my favourites on Christmas day. We make it with extra\n    mint from the garden which is a nice, fresh addition.', u'"Rather\n    than focusing on food avoidance, moderation is the best approach.', u'"There\n    are definitely some more healthy choices and some less healthy options when it\n    comes to the typical Christmas day menu, but it\'s more important to be mindful\n    of a healthy, balanced diet throughout the festive period, rather than avoiding\n    specific foods on one day of the year."', u'\n    ', u'\n    \n    ', u'\n    ', u'\n    \n    ', u'\n    ', u'\n    ', u'\n      ', u'\n      ', u'\n      ', u'\n     ', u'\n   ', u'Related News', u'\n   ', u'\n  ', u'\n   ', u'\n  ', u'\n   ', u'\n  ', u'Search for related news'] 

有很多空格的,換行代碼和「U」字母....

如何進一步處理這個代碼只包含普通文本,免費額外的空格,換行(\ n )代碼和'你'字母?

我讀到BeautifulSoup與Scrapy很好地合作,但我找不到任何有關如何將Scrapy與BeautifulSoup集成的例子。我也願意使用任何其他方法。任何幫助非常感謝。

感謝

+0

相關:http://stackoverflow.com/q/21839877/4063051 – glS

+0

'u'只是你在unicode列表中有文本的信息。如果你從列表中打印單個元素,那麼你會看到沒有'u' – furas

+0

的文本很清楚,你只是想從這些字符串中刪除換行符和空格? – glS

回答

0

您可以從例如顯示in this answer的方法,用在列表中的字符串剝離空格和換行:

[' '.join(item.split()) for item in list_of_strings] 

其中list_of_strings是你給的例子字符串列表。

關於「ü字母」,你不應該擔心它們。 他們只是表示該字符串是unicode編碼。見例如關於此事的this question

+0

謝謝,我該如何使用它?我在Scrapy shell中運行了這個 – Slyper

+0

謝謝,我該如何使用它?我在Scrapy shell中運行這個腳本 ''.join(myString.split()) 得到這個錯誤 _AttributeError:'list'對象沒有屬性'split'_ – Slyper

+0

如果你保存了你在問題中輸入的字符串列表作爲一個變量'list_of_string',你只需運行上面的代碼並獲得與空白和換行符元素相同的列表 – glS