刮一個論壇帖子：如何從css margin屬性計算後續關係？

我一直在努力嘗試這個。刮一個論壇帖子：如何從css margin屬性計算後續關係？

我的刮臉工作的目標網站是一個老式論壇，在他們的索引頁面中，每個線程都在<div>標籤中，每個帖子都在<p>標籤中。後續帖子的左邊距縮進20px表示關係。

<div> 
    <p style="margin:2px 0 17px 0px; width:705px"><a href="./6368972.html" class="post">original post</a>other stuff</p> 
    <p style="margin:2px 0 2px 20px; width:683px"><a href="./6368973.html" class="post">reply post</a>other stuff</p> 
    <p style="margin:2px 0 2px 40px; width:661px"><a href="./6368974.html" class="post">reply post</a>other stuff</p> 
    ... 
</div>

我能提取很多信息來源這裏包括標題，日期時間，暱稱等，但後續的關係，我需要你的一個有效的算法幫助。基本上我需要知道這篇文章是對之前文章的回覆。

我的項目包含後續關係的字段，即： reply_to = scrapy.Field() 其中字段應存儲reply_to帖子的url。

我可以爲每個崗位作爲提取左邊距的值： margin = int(div.css('p::attr(style)').re('.* (\d+)px;.*'))

另外我可以計算div的長度（即，許多總訊息如何在那裏在一個線程）。

但真的不知道我怎麼會從這裏走......

謝謝大家！

1---------------------  # left margin = 0px; original post 
2 -------------------  # left margin = 20px; reply to post 1 
3 -----------------  # left margin = 40px; reply to post 2 
4 -------------------  # left margin = 20px; reply to post 1, not 3 
5 -----------------  # left margin = 40px; reply to post 4, not 2 
6  ---------------  # left margin = 60px; reply to post 5

來源

2017-08-05 eN_Joy

目前尚不清楚你到底想要做什麼。你可以分享有關提供的HTML源文件的所需輸出嗎？ – Andersson

這是未經測試，但可能工作：

parent = list() 
for p in div.xpath('./p'): 
    post = dict() 
    # do whatever extraction from post here -- title, datetime etc. 
    # post['title'] = p.xpath(...) 
    # ... 
    post['url'] = p.xpath('./a/@href').extract_first() 

    post['reply_to'] = parent.pop() if len(parent) else None 
    margin = int(p.xpath('./@style').re_first('.* (\d+)px;.*')) 

    next_p = p.xpath('./following-sibling::p[1]') 
    if next_p: 
     next_margin = int(next_p.xpath('./@style').re_first('.* (\d+)px;.*')) 
     if next_margin > margin: 
      # next post is a reply to this post 
      if post['reply_to']: 
       parent.append(post['reply_to']) 
      parent.append(post['url']) 
     elif next_margin == margin: 
      # next post is a reply to direct parent post 
      parent.append(post['reply_to']) 
     else: 
      # next post if a reply to some distant parent post 
      for _ in range((margin - next_margin)/20 - 1): 
       parent.pop() 

    yield post

基本上是以你走線樹，它使用堆棧來存儲指向父的帖子。這樣，您不必前後搜索樹來查找當前回覆的帖子，但只能訪問每個節點一次（呃，兩次，因爲您總是看下一個兄弟節點）。

使用XPath和正則表達式可以更簡單，但我認爲Scrapy選擇器只使用不支持XPath 1.0的XPath。如我錯了請糾正我。

來源

2017-08-05 19:54:11

老實說，我不明白_中的下劃線_作爲範圍（（margin - next_margin）/ 20-1）：'... –

就像一個魅力，正是我所知尋找！對於那些需要這些代碼的人，將'（（margin - next_margin）/ 20 - 1）'轉換爲整數;也可能需要將yield post改爲不同的東西，至少當我在scrapy shell中嘗試這樣做時，代碼尚未添加到我的抓取程序文件中。謝謝@TomášLinhart！ –

@JindanZhou關於下劃線 - 我只需要在那裏循環幾次，但我對'範圍'產生的特定值不感興趣。 Underscore是一個有效的變量標識符，在這個用例中，服務器就像一些虛擬變量。請參閱[這裏]（https://hackernoon.com/understanding-the-underscore-of-python-309d1a029edc）以獲得更好的解釋。 –

可以使用re:test XPath表達式樣式屬性匹配正則表達式的一些：

>[1]: sel.xpath('//p[re:test(@style,"margin[^;]+20px")]').extract() 
<[1]: ['<p style="margin:2px 0 2px 20px; width:683px"><a href="./6368973.html" class="post">reply post</a>other stuff</p>']

「// P [重新：測試（@style，「裕度[^] 20像素」）] '擊穿：

//p - 選擇任何<p>節點
[re:test(@style,"margin.+20px")] - 測試@style屬性是否margin.+20px正則表達式匹配。

來源

2017-08-05 19:28:06 Granitosaurus

這看起來非常有前途，一旦我坐在電腦旁，我就會玩這個。與此同時， –

刮一個論壇帖子：如何從css margin屬性計算後續關係？

回答

相關問題