2012-03-03 49 views
3

我有一個Glype代理,我不想分析外部URL。網頁上的所有網址都會自動轉換爲:http://proxy.com/browse.php?u=[URL HERE]。例如:如果我訪問海盜灣在我代理的話,我想不解析以下網址:Preg-replace - 替換除域和其子域以外的所有URL

ByteLove.com (Not to: http://proxy.com/browse.php?u=http://bytelove.com&b=0) 
BayFiles.com (Not to: http://proxy.com/browse.php?u=http://bayfiles.com&b=0) 
BayIMG.com (Not to: http://proxy.com/browse.php?u=http://bayimg.com&b=0) 
PasteBay.com (Not to: http://proxy.com/browse.php?u=http://pastebay.com&b=0) 
Ipredator.com (Not to: http://proxy.com/browse.php?u=https://ipredator.se&b=0) 
etc. 

我當然想保持內部URL,因此:

thepiratebay.se/browse (To: http://proxy.com/browse.php?u=http://thepiratebay.se/browse&b=0) 
thepiratebay.se/top (To: http://proxy.com/browse.php?u=http://thepiratebay.se/top&b=0) 
thepiratebay.se/recent (To: http://proxy.com/browse.php?u=http://thepiratebay.se/recent&b=0) 
etc. 

有preg_replace替換除了thepiratebay.se之外的所有URL,還有子域名(如示例中所示)?另一個功能也是受歡迎的。 (如DOM文檔,QueryPath中,SUBSTR或strpos不str_replace函數,因爲那時我應該定義的所有URL)。

我找到了一些東西,但我不熟悉的preg_replace:

$exclude = '.thepiratebay.se'; 
$pattern = '(https?\:\/\/.*?\..*?)(?=\s|$)'; 
$message= preg_replace("~(($exclude)?($pattern))~i", '$2<a href="$4" target="_blank">$5</a>$6', $message); 

回答

1

我猜你會需要提供一個白名單來判斷哪些領域應該被代理

$whitelist = array(); 
$whitelist[] = "internal1.se"; 
$whitelist[] = "internal2.no"; 
$whitelist[] = "internal3.com"; 
// and so on... 

$string = '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Fexternal1.com&b=0">External link 1</a><br>'; 
$string .= '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Finternal1.se&b=0">Internal link 1</a><br>'; 
$string .= '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Finternal3.com&b=0">Internal link 2</a><br>'; 
$string .= '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Fexternal2.no&b=0">External link 2</a><br>'; 

//Assuming the URL always is inside '' or "" you can use this pattern: 
$pattern = '#(https?://proxy\.org/browse\.php\?u=(https?[^&|\"|\']*)(&?[^&|\"|\']*))#i'; 

$string = preg_replace_callback($pattern, "my_callback", $string); 

//I had only PHP 5.2 on my server, so I decided to use a callback function. 
function my_callback($match) { 
    global $whitelist; 
    // set return bypass proxy URL 
    $returnstring = urldecode($match[2]); 

    foreach ($whitelist as $white) { 
     // check if URL matches whitelist 
     if (stripos($match[2], $white) > 0) { 
      $returnstring = $match[0]; 
      break; } } 
    return $returnstring; 
} 

echo "NEW STRING[:\n" . $string . "\n]\n"; 
+0

它不工作,這是我的代碼:http://pastebin.com/6ML8q7JN URL的位於:$ document – 2012-03-03 18:03:09

+0

我需要查看$ document變量的內容以評估鱈魚是否可以工作。 – 2012-03-03 18:11:42

+0

它現在正在工作,但_&b = 0_在url後面。如何解決這個問題? – 2012-03-04 15:55:41

1

可以使用preg_replace_callback()爲每個匹配執行回調函數。在該函數中,您可以確定是否應該轉換匹配的字符串。

<?php 
$string = 'http://foobar.com/baz and http://example.org/bumm'; 
$pattern = '#(https?\:\/\/.*?\..*?)(?=\s|$)#i'; 
$string = preg_replace_callback($pattern, function($match) { 
    if (stripos($match[0], 'example.org/') !== false) { 
     // exclude all URLs containing example.org 
     return $match[0]; 
    } else { 
     return 'http://proxy.com/?u=' . urlencode($match[0]); 
    } 
}, $string); 

echo $string, "\n"; 

(例子是使用PHP 5.3閉符號)