2012-03-28 87 views

回答

0

自從過去2年來我一直在使用nutch代碼庫,並且據我所知,這是不可能的。一旦內容進入nutch段,你不能剝離下拉菜單,導航等部分,並只保留所需的東西。

如果您或其他人知道如何去做(不需要修改代碼),請分享一下。

1

不確定,如果你仍然需要這樣做,但是如果你這樣做,你可以嘗試blacklist_whitelist插件,它可以在https://issues.apache.org/jira/browse/NUTCH-585找到。

該插件允許您擁有要阻止或允許但不是兩者的元素列表。 例如:

<property> 
    <name>parser.html.blacklist</name> 
    <value>noscript,div,#footer</value> 
    <description> 
    A comma-delimited list of css like tags to identify the elements which should 
    NOT be parsed. Use this to tell the HTML parser to ignore the given elements, e.g. site navigation. 
    It is allowed to only specify the element type (required), and optional its class name ('.') 
    or ID ('#'). More complex expressions will not be parsed. 
    Valid examples: div.header,span,p#test,div#main,ul,div.footercol 
    Invalid expressions: div#head#part1,#footer,.inner#post 
    Note that the elements and their children will be silently ignored by the parser, 
    so verify the indexed content with Luke to confirm results. 
    Use either 'parser.html.blacklist' or 'parser.html.whitelist', but not both of them at once. If so, 
    only the whitelist is used. 
    </description> 
</property>