有沒有像Python那樣的readability.js？

def link_list_discriminator(html, min_links=2, ratio=0.5): 
    """Remove blocks with a high link to text ratio. 

    These are typically navigation elements. 

    Based on an algorithm described in: 
     http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf 

    :param html: ElementTree object. 
    :param min_links: Minimum number of links inside an element 
         before considering a block for deletion. 
    :param ratio: Ratio of link text to all text before an element is considered 
        for deletion. 
    """ 
    def collapse(strings): 
     return u''.join(filter(None, (text.strip() for text in strings))) 

    # FIXME: This doesn't account for top-level text... 
    for el in html.xpath('//*'): 
     anchor_text = el.xpath('.//a//text()') 
     anchor_count = len(anchor_text) 
     anchor_text = collapse(anchor_text) 
     text = collapse(el.xpath('.//text()')) 
     anchors = float(len(anchor_text)) 
     all = float(len(text)) 
     if anchor_count > min_links and all and anchors/all > ratio: 
      el.drop_tree()

在我使用的測試語料庫上，它實際上工作得很好，但實現高可靠性需要大量的調整。

來源

2010-05-29 07:20:03

我們剛剛在repustate.com上推出了一種新的自然語言處理API。使用REST API，您可以清除任何HTML或PDF並僅取回文本部分。我們的API是免費的，因此可隨意使用您的內容。它在python中實現。檢查一下，並將結果與readability.js進行比較 - 我想你會發現它們幾乎是100％相同。

來源

2010-05-31 19:47:57 Martin

嗯，看起來很有希望！ ;-)我會試一試。有沒有嚴格的限制？我每天可以處理多少頁？ – 2010-06-01 07:47:50

哇，我只是用你的網站輸入一些網址，並且它完美地提取了文章。 – 2010-08-03 17:37:43

hn.py via Readability's blog。一款App Engine應用程序Readable Feeds利用它。

我已經捆綁了在這裏一個點安裝的模塊：http://github.com/srid/readability

來源

2010-09-07 01:11:09

與現在可用的版本相比，這看起來是一個非常舊的版本：0.4與1.7.1。任何更新機會？ – 2010-12-30 13:46:07

有沒有像Python那樣的readability.js？

回答

相關問題