2011-04-18 47 views
0

我有下面的腳本,我的學校CS部門獲取所有課程的清單。我希望能夠提取CRN(課程編號)和其他重要信息,以便將其放入數據庫中,以便用戶瀏覽Web應用程序。Web :: Scraper和Perl

下面是一個例子網址: http://courses.illinois.edu/cis/2011/spring/schedule/CS/411.html

我想從這樣的網頁中提取信息。刮板的第一層只是從所有課程列表中構建出各個站點。一旦我在課程特定的目錄頁面上,我使用第二個刮板試圖獲取我想要的所有信息。出於某種原因,儘管CRN和課程導師都是'td'元素。刮刀時,我的刮刀似乎沒有任何回報。我試圖專門爲'div'刮,而我爲每個相關頁面獲得一堆信息。所以不知何故,我無法獲得'td'元素,但是我正在從正確的頁面中刪除。

my $tweets = scraper { 
     # Parse all LIs with the class "status", store them into a resulting 
     # array 'tweets'. We embed another scraper for each tweet. 
    # process "h4.ws-ds-name.detail-title", "array[]" => 'TEXT'; 
     process "div.ws-row", "array[]" => 'TEXT'; 
     }; 

my $res = $tweets->scrape(URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169")); 

foreach my $elem (@{$res->{array}}){ 

my $coursenum = substr($elem,2,4); 

my $secondLevel = scraper{ 
process "td.ws-row", "array2[]" => 'TEXT'; 
}; 

my $res2 = $secondLevel->scrape(URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/$coursenum.html")); 
my $num = @{$res2->{array2}}; 
print $num; 

print "---------------------", "\n"; 
my @curr = @{$res2->{array2}}; 
foreach my $elem2 (@curr){ 
$num++; 
print $elem2, " ", "\n"; 
} 
print "---------------------", "\n"; 
} 

任何想法?

感謝

+0

我使用Web :: Scraper的方式 – 2011-04-18 02:38:51

回答

1

在我看來就像

my $coursenum = substr($elem,2,4) 

應該

my $coursenum = substr($elem,3,3) 
1

在這種情況下,走的是使用

HTML::TableExtract 

如果你是最簡單的方法從表中尋找數據LY。

1

我玩了一下你的問題。可以在初始刮板內獲得課程ID,標題和鏈接個別課程頁:

my $courses = scraper { 
    process 'div.ws-row', 
     'course[]' => scraper { 
      process 'div.ws-course-number', 'id' => 'TEXT'; 
      process 'div.ws-course-title', 'title' => 'TEXT'; 
      process 'div.ws-course-title a', 'link' => '@href'; 
     }; 
    result 'course'; 
}; 

拼搶的結果是數組引用像這樣hashrefs:

{ id => "CS 103", 
    title => "Introduction to Programming", 
    link => bless(do{\(my $o = "http://courses.illinois.edu/cis/2011/spring/schedule/CS/103.html?skinId=2169")}, "URI::http"), 
}, 
.... 

然後,你可以爲每個做額外的刮從他們的個人網頁當然並且這樣的信息添加到原始結構:

for my $course (@$res) { 
    my $crs_scraper = scraper { 
     process 'div.ws-description', 'desc' => 'TEXT'; 
     # ... add more items here 
    }; 
    my $additional_data = $crs_scraper->scrape(URI->new($course->{link})); 

    # slice assignment to add them into course definition 
    @{$course}{ keys %$additional_data } = values %$additional_data; 
} 

源組合在一起如下:

use strict; use warnings; 
use URI; 
use Web::Scraper; 
use Data::Dump qw(dump); 

my $url = 'http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169'; 

my $courses = scraper { 
    process 'div.ws-row', 
     'course[]' => scraper { 
      process 'div.ws-course-number', 'id' => 'TEXT'; 
      process 'div.ws-course-title', 'title' => 'TEXT'; 
      process 'div.ws-course-title a', 'link' => '@href'; 
     }; 
    result 'course'; 
}; 

my $res = $courses->scrape(URI->new($url)); 

for my $course (@$res) { 
    my $crs_scraper = scraper { 
     process 'div.ws-description', 'desc' => 'TEXT'; 
     # ... add more items here 
    }; 
    my $additional_data = $crs_scraper->scrape(URI->new($course->{link})); 

    # slice assignment to add them into course definition 
    @{$course}{ keys %$additional_data } = values %$additional_data; 
} 

dump $res;