解析HTML日誌文件並獲取特定格式的文本文件

-1

我想使用Perl解析文本文件。此文本文件包含一些HTML文件的日誌，如下所示：解析HTML日誌文件並獲取特定格式的文本文件

Details from /projects/git/Changelog.html file: 
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4208">JIRA-4208</a><span style='mso-spacerun:yes'>   </span>Add New Config C support in code 
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-31">BUGJIRA-31</a><span style='mso-spacerun:yes'>   </span>Bugfix of some old bug 
NEW_FEATURES: <a href="http://jira.xyz.com/browse/ZEERA-273">ZEERA-273</a><span style='mso-spacerun:yes'>   </span>Add support for some other feature. 

Details from /projects/git/Changelog2.html file: 
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-33">BUGJIRA-33</a><span style='mso-spacerun:yes'>   </span>Bugfix of an issue 
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4209">JIRA-4209</a><span style='mso-spacerun:yes'>   </span>Add New Config D support in code

每行包含一個Bug編號及其說明。

解析後，將期望的輸出是如下：

JIRA-4208, BUGJIRA-31, ZEERA-273, BUGJIRA-33, JIRA-4209 : Add New Config C support in code, Bugfix of some old bug, Add support for some other feature, Bugfix of an issue, Add New Config D support in code

即，所有的錯誤數，隨後對它們的描述。

如果可能的話我還想寫輸出在另一個文件中output.txt

編輯-1：

我的代碼下：

#!/usr/bin/perl 
open (FILE, 'input_file1.txt') or die "Could not read from file, exit..."; 
while(<FILE>) 
{ 
    chomp; 
    ($junk0,$junk1,$junk2,$junk3,$junk4,$BUG_NUMBR) = split /[:<="">]+/,$_; 
    print "$BUG_NUMBR \n"; 
} 
close FILE; 
exit;

，輸出是：

JIRA-4208 
BUGJIRA-31 
ZEERA-273 
BUGJIRA-33 
JIRA-4209

這與預期的輸出大不相同，如圖所示五個。我無法爲預期輸出的第二部分定義合適的正則表達式，這是對錯誤的簡短描述。

來源

2017-05-09 Yash

那你究竟試過了什麼？什麼不能在你的代碼中工作？這裏有什麼問題？ –

@ChrisDoyle：我已經添加了示例代碼並解釋了它的侷限性。請求您提出解決方案。 – Yash

你真的想要一個所有錯誤編號的列表，然後是所有描述的列表嗎？ –

你不需要正則表達式。您的split模式很有趣，但它完成了工作。

只要拿出結果的其餘部分。我用一個數組替換了你的$junk變量。 Perl可以讓你從右邊的-1中取出右邊的最後一個元素，所以將文本取出是很簡單的，因爲它在最後的>之後。

use strict; 
use warnings; 

my (@numbers, @text); 
while (my $line = <DATA>) { 
    chomp $line; 
    my @stuff = split /[:<="">]+/, $line; 
    push @numbers, $stuff[5]; 
    push @text, $stuff[-1]; 
} 

print join ', ', @numbers; 
print ' : '; 
print join ', ', @text; 

__DATA__ 
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4208">JIRA-4208</a><span style='mso-spacerun:yes'> </span>Add New Config C support in code 
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-31">BUGJIRA-31</a><span style='mso-spacerun:yes'> </span>Bugfix of some old bug 
NEW_FEATURES: <a href="http://jira.xyz.com/browse/ZEERA-273">ZEERA-273</a><span style='mso-spacerun:yes'> </span>Add support for some other feature. 
BUG_FIX: <a href="http://jira.xyz.com/browse/BUGJIRA-33">BUGJIRA-33</a><span style='mso-spacerun:yes'> </span>Bugfix of an issue 
NEW_FEATURES: <a href="http://jira.xyz.com/browse/JIRA-4209">JIRA-4209</a><span style='mso-spacerun:yes'> </span>Add New Config D support in code

我還添加了嚴格和警告，並使您的變量詞法。

還請記住，如果文本包含文字>或<或引號或其他內容，您的代碼將會中斷。這是一種奇怪的格式，而HTML解析器並不能真正幫助你。

來源

2017-05-09 13:25:55 simbabque

感謝您分享代碼。除了一些例外，它適用於我。您在示例中使用的數據集與我在上述問題說明中給出的數據集不同。我的數據集有一些警告和額外的逗號（，）。不過，它對我來說是一個很好的起點。當我解決問題時，我會分享最終的代碼。再次感謝！！ – Yash

嗨@simbabque，你提到我的'分裂'模式很有趣，儘管它完成了工作。我同意這一點，因爲我得到了它的小瓶試驗和錯誤。請求你建議一個更好的方式來分割模式。 – Yash

@Yash我沒有意識到這是一個文件。抱歉。你做了正確的事情來檢查'^ Details'是否在當前行中。通過_有趣_我的意思是這種方法是非常規的。我可能會寫一個模式來捕捉我想要的東西，但你的方法也適用。請記住，如果輸入發生變化，它會中斷。 – simbabque

上面提到的問題陳述的代碼是爲下：

#!/usr/bin/perl 

use strict; 
use warnings; 

open (FILE, 'perl_input_file1.txt') or die $!; 
my (@numbers, @text); 
while (my $line = <FILE>) { 
    chomp $line; 
    $line =~ /^Details/ and next; 
    my @stuff = split /[:<="">]+/, $line; 
    push @numbers, $stuff[5]; 
    push @text, $stuff[-1]; 
} 
close FILE; 
print join ', ', @numbers; 
print ': '; 
print join ', ', @text; 
print "\n";

這段代碼的輸出是：

JIRA-4208, BUGJIRA-31, ZEERA-273, BUGJIRA-33, JIRA-4209: Add New Config C support in code, Bugfix of some old bug, Add support for some other feature, Bugfix of an issue, Add New Config D support in code

如在問題中提到這是同我所希望的預期輸出。

我想再次感謝@simbabque的指導和方法。

Regards，

來源

2017-05-10 06:42:38 Yash

解析HTML日誌文件並獲取特定格式的文本文件

回答

相關問題