2014-09-25 128 views
0

這是我想解析如何使用HTML :: TreeBuilder解析html?

[...] 
<div class="item" style="clear:left;"> 
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);"> 
</div> 
    <h2>Acid Splash</h2> 
    <p>Caster Level(s): Wizard/Sorcerer 0 
    <br />Innate Level: 0 
    <br />School: Conjuration 
    <br />Descriptor(s): Acid 
    <br />Component(s): Verbal, Somatic 
    <br />Range: Medium 
    <br />Area of Effect/Target: Single 
    <br />Duration: Instant 
    <br />Save: None 
    <br />Spell Resistance: Yes 
    <p> 
    You fire a small orb of acid at the target for 1d3 points of acid damage. 
</div> 
[...] 

代碼這是我的算法:

my $text = ''; 

scan_child($spells); 

print $text, "\n"; 

sub scan_child { 
    my $element = $_[0]; 
    return if ($element->tag eq 'script' or 
      $element->tag eq 'a'); # prune! 
    foreach my $child ($element->content_list) { 
    if (ref $child) { # it's an element 
     scan_child($child); # recurse! 
    } else {   # it's a text node! 
     $child =~ s/(.*)\:/\\item \[$1\]/; #itemize 
     $text .= $child; 
     $text .= "\n"; 
    } 
    } 
    return; 
} 

它得到的模式<key> : <value>和李子垃圾像<script><a>...</a>。 我想改進它以獲得<h2>...</h2>標題和所有<p>...<p>塊,以便我可以添加一些LaTeX標記。

任何線索?

在此先感謝。

+0

也許你應該退後一步,計算出你想從你正在抓取的頁面中提取什麼信息,以及你想如何存儲它。如果您有一個特定的模式或數據結構,將其添加到問題中將會很有幫助。如果你只是想提取所有的文字,那麼你已經很順利。 – 2014-09-25 20:58:14

+0

也許,我仍然不清楚HTML :: TreeBuilder在節點中存儲了什麼。 – Daniele 2014-09-25 21:39:22

回答

0

因爲這可能是一個問題XY ...

Mojo::DOM是使用CSS選擇器解析HTML稍微更現代的框架。下面拉你從文檔所需的P元素:

use strict; 
use warnings; 

use Mojo::DOM; 

my $dom = Mojo::DOM->new(do {local $/; <DATA>}); 

for my $h2 ($dom->find('h2')->each) { 
    next unless $h2->all_text eq 'Acid Splash'; 

    # Get following P 
    my $next_p = $h2; 
    while ($next_p = $next_p->next_sibling()) { 
     last if $next_p->node eq 'tag' and $next_p->type eq 'p'; 
    } 

    print $next_p; 
} 

__DATA__ 
<html> 
<body> 
<div class="item" style="clear:left;"> 
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);"> 
</div> 
    <h2>Acid Splash</h2> 
    <p>Caster Level(s): Wizard/Sorcerer 0 
    <br />Innate Level: 0 
    <br />School: Conjuration 
    <br />Descriptor(s): Acid 
    <br />Component(s): Verbal, Somatic 
    <br />Range: Medium 
    <br />Area of Effect/Target: Single 
    <br />Duration: Instant 
    <br />Save: None 
    <br />Spell Resistance: Yes 
    <p> 
    You fire a small orb of acid at the target for 1d3 points of acid damage. 
</div> 
</body> 
</html> 

輸出:

<p>Caster Level(s): Wizard/Sorcerer 0 
    <br>Innate Level: 0 
    <br>School: Conjuration 
    <br>Descriptor(s): Acid 
    <br>Component(s): Verbal, Somatic 
    <br>Range: Medium 
    <br>Area of Effect/Target: Single 
    <br>Duration: Instant 
    <br>Save: None 
    <br>Spell Resistance: Yes 
    </p> 
0

我使用look_down()方法掃描HTML。 使用look_down()我可以先返回所有class =「item」的div的列表。

然後我可以迭代它們,並找到並處理h2p,然後我將使用//作爲分隔符分割。