在PHP中高效解析Apache日誌

好的，這是場景：我需要解析我的日誌以查找在沒有實際觀看「大圖片」頁面的情況下下載了多少次圖像縮略圖... 這基本上是一個熱鏈接基於「拇指」與「完整」圖像視圖比例的保護系統在PHP中高效解析Apache日誌

考慮到服務器不斷遭到縮略圖請求的轟炸，最有效的解決方案似乎使用緩衝的apache日誌，每寫一次，1Mb，然後定期解析日誌

我的問題是這樣的：我如何解析PHP中的apache日誌來保存數據，以下是對的：

的日誌將被使用，並實時更新，我需要我的PHP腳本能夠閱讀它，而這麼做是
PHP腳本會「記住」它的零件記錄它讀取，以免兩次讀取相同的部分和歪斜數據
內存消耗量應在最低限度，因爲日誌可以很容易地在幾個小時內達到10GB的數據

的PHP記錄器腳本將每60秒調用一次，並在此期間處理任何數量的日誌行..

我已經試過黑客一些代碼在一起，但我一直在使用的內存的最小量，找到一個方法來跟蹤指針的一個「移動」文件大小

這裏的問題是日誌的一部分：

212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-" 
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-" 
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"

附加的代碼在這裏您的評論：

<?php 
//limit for running it every minute 
error_reporting(E_ALL); 
ini_set('display_errors',1); 
set_time_limit(0); 
include(dirname(__FILE__).'/../kframework/kcore.class.php'); 
$aj = new kajaxpage; 
$aj->use_db=1; 
$aj->init(); 
$db=kdbhandler::getInstance(); 
$d=kdebug::getInstance(); 
$d->debug=TRUE; 
$d->verbose=TRUE; 

$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron 
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid"; 
//$images_id = array("8308086", "7485151", "6666231", "8343336"); 

if (file_exists($pid_file)) { 
    $pid = file_get_contents($pid_file); 
    $temp = explode(" ", $pid); 
    $pid_timestamp = $temp[0]; 
    $now_timestamp = strtotime("now"); 
    //if (($now_timestamp - $pid_timestamp) < 90) return; 
    $pointer = $temp[1]; 
    if ($pointer > filesize($log_file)) $pointer = 0; 
} 
else $pointer = 0; 

$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/"; 
$last_time = 0; 
$lines_processed=0; 

if ($fp = fopen($log_file, "r+")) { 
    fseek($fp, $pointer); 
    while (!feof($fp)) { 
     //if ($lines_processed>100) exit; 
     $lines_processed++; 
     $log_line = trim(fgets($fp)); 
     if (!empty($log_line)) { 
      preg_match_all($pattern, $log_line, $matches); 
      //print_r($matches); 
      $size = $matches[5][0]; 
      $matches[3][0] = str_replace("GET ", "", $matches[3][0]); 
      $matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]); 
      $matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]); 
      if (substr($matches[3][0],0,3) == "/t/") { 
       $get = explode("-",end(explode("/",$matches[3][0]))); 
       $imgid = $get[0]; 
       $type='thumb'; 
      } 
      elseif (substr($matches[3][0], 0, 5) == "/img/") { 
       $get1 = explode("/", $matches[3][0]); 
       $get2 = explode("-", $get1[2]); 
       $imgid = $get2[0]; 
       $type='raw'; 
      } 
      echo $matches[3][0]; 
      // put here your sql insert or update 
      $imgid=(int) $imgid; 
      if (isset($type) && $imgid!=1) { 
       switch ($type) { 
        case 'thumb': 
         //use the second slave in the registry 
         $sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2); 
         echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1"; 
        break; 
        case 'raw': 
         //use the second slave in the registry 
         $sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2); 
         echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1"; 
        break; 
       } 
      } 

      // $imgid - image ID 
      // $size - image size 

      $timestamp = strtotime("now"); 
      if (($timestamp - $last_time) > 30) { 
       file_put_contents($pid_file, $timestamp . " " . ftell($fp)); 
       $last_time = $timestamp; 
      } 
     } 
    } 
    file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp)); 
    fclose($fp); 
} 

?>

來源

2012-01-18 Igor

一個解決辦法是將日誌存儲到一個MySQL數據庫。也許你可以編寫一個C語言程序來解析日誌文件，然後將它存儲在mysql中。這會更快，而且不是很困難。另一種選擇是使用phyton，但我認爲使用數據庫是必要的。您可以使用全文索引來匹配您的字符串。 Python也可以編譯爲二進制文件。這使得它更有效率。根據請求：日誌文件堆棧增量。這不是你一次給10GB。

來源

2012-01-18 19:13:26 Bytemain

他在幾個小時內就說了10G的數據。在總結我真正需要的內容之前，絕對不是我想要的MySQL。全文索引（暗示MyISAM）就像這樣的數據將是一場災難。 – Evert 2012-01-18 19:16:14

@Evert：但是iit從0字節的日誌文件開始？看到我的答案。 – Bytemain 2012-01-18 19:25:18

它不以一個空的日誌開始......它以幾十GB的數據開始：/我發佈的腳本超時出現內存分配錯誤，所以我認爲必須在某處發生泄漏，我可以' t似乎找到它...我的印象是使用fgets只會保持當前行在內存..是「pid」文件的想法，以跟蹤指針任何好？ – Igor 2012-01-18 19:36:25