2013-03-22 46 views
0

我試圖從HTML頁面中獲取所有獨特的電子郵件到數組中。該文件是巨大的,並沒有真正的模式來獲取電子郵件。PHP從一個巨大的html文件中提取獨特的電子郵件,將其放入數組中

下面是一個名爲GetEmails.html的示例html ---實際的文件將包含css和更多的代碼來篩選。在這個例子中,注意電子郵件的獨特模式。總之不是所有用空格分開,但有的用逗號和半冒號等。

<html> 
<body> 
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong> 
</p> 
<p><u>There will be pages and pages and pages of text to sift thru so get the emails into an array.</u></p> 
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong> and repeat This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong></p> 
<p>&nbsp;</p> 
</body> 
</html> 

我想使用帶有空格的爆炸,但可能不工作,並且可能會佔用太多的資源。只是想知道在PHP中是否有一個簡單的函數來幫助我將所有的電子郵件轉換爲數組。這是我試過的。

<? 

$lines = file('GetEmails.html'); 


foreach ($lines as $line_num => $line) { 

/// Finds if line has email. 
    if (preg_match('/\b[A-Z0-9._%+-][email protected][A-Z0-9.-]+\.[A-Z]{2,4}\b/si', $line)) 
{ 

// Puts that line into an array 
$line = explode(" " , strip_tags($line)); 

// Finds if one of the itmes has an @ sign 
$fl_array = preg_grep("/@/", $line); 

// Puts that email in an array 
$TheEmails[] = trim($fl_array); 

// Puts only the unique emails an an array 
$UniqueEmails= array_unique($TheEmails); 

?> 

但是,上面的代碼工作,我將使用的巨大文件恐怕它不必要地使用資源。此外,它不會考慮用逗號分隔的電子郵件,如ed @ ed.com,mike @ mike.com

有關最佳方式的任何想法? 至少這將是非常非常有幫助學習如何做到這一點最好的方式,即使我只能得到由空間等分開的電子郵件...

希望這是有道理的。非常感謝!

+0

'preg_match_all'? – Tchoupi 2013-03-22 03:15:35

+0

它不是重複的,因爲我不相信問題可以解決電子郵件旁邊有字符的問題,如逗號或< or a >等。 – 2013-03-22 03:29:32

+0

其實我是錯誤的。該鏈接上的代碼工作。我應該刪除這篇文章還是相信那篇文章? – 2013-03-22 03:33:44

回答

相關問題