2011-04-26 112 views
0

我有一個大約有150萬公司記錄(名稱,國家和其他小文本字段)的MySQL數據庫我想用一個標記標記相同的記錄(例如,如果兩個同名的公司美國然後我必須設置一個字段(match_id)等於一個整數10),同樣適用於其他比賽。目前它需要很長時間(天),我覺得我沒有正確使用MYsql我發佈我的代碼下面,有沒有更快的方法來做到這一點?MYSQL匹配文本字段

<?php 

//Create the table if does not already exist 
mysql_query("CREATE TABLE IF NOT EXISTS proj ( 
    id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY , 
    company_id text NOT NULL , 
    company_name varchar(40) NOT NULL , 
    company_name_text varchar(33) NOT NULL, 
    company_name_metaphone varchar(19) NOT NULL, 
    country varchar(20) NOT NULL , 
    file_id int(2) NOT NULL , 
    thompson_id varchar(11) NOT NULL , 
    match_no int(7) NOT NULL , 
    INDEX(company_name_text))") 
    or die ("Couldn't create the table: " . mysql_error()); 


//********Real script starts******** 
$countries_searched = array(); //To save record ids already flagged (save time) 
$counter = 1; //Flag 

//Since the company_names which are same are going to be from the same country so I get all the countries first in the below query and then in the next get all the companies in that country 
$sql = "SELECT DISTINCT country FROM proj WHERE country='Canada'"; 
$result = mysql_query($sql) or die(mysql_error()); 

while($resultrow = mysql_fetch_assoc($result)) { 
    $country = $resultrow['country']; 
    $res = mysql_query("SELECT company_name_metaphone, id, company_name_text 
    FROM proj 
    WHERE country='$country' 
    ORDER BY id") or die (mysql_error()); 


    //Loop through the company records 
    while ($row = mysql_fetch_array($res, MYSQL_NUM)) { 

    //If record id is already flagged (matched and saved in the countries searched  array) don't waste time doing anything  
    if (in_array($row[1], $countries_searched)) { 
     continue; 
    } 

    if (strlen($row[0]) > 9) { 
     $row[0] = substr($row[0],0,9); 
     $query = mysql_query("SELECT id FROM proj 
     WHERE country='$country' 
     AND company_name_metaphone LIKE '$row[0]%' 
     AND id<>'$row[1]'") or die (mysql_error()); 

     while ($id = mysql_fetch_array($query, MYSQL_NUM)) { 
     if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0]; 
     } 
     if(mysql_num_rows($query) > 0) { 

     mysql_query("UPDATE proj SET match_no='$counter' 
        WHERE country='$country' 
        AND company_name_metaphone LIKE '$row[0]%'") 
      or die (mysql_error()." ".mysql_errno()); 
     $counter++; 
     } 
    } 
    else if(strlen($row[0]) > 3) { 
     $query = mysql_query("SELECT id FROM proj WHERE country='$country' 
       AND company_name_text='$row[2]' AND id<>'$row[1]'") 
     or die (mysql_error()); 
     while ($id = mysql_fetch_array($query, MYSQL_NUM)) { 
     if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0]; 
     } 
     if(mysql_num_rows($query) > 0) { 
     mysql_query("UPDATE proj SET match_no='$counter' 
        WHERE country='$country' 
        AND company_name_text='$row[2]'") or die (mysql_error()); 
     $counter++; 
     } 
    } 
    } 
} 
?> 
+0

請修復您的代碼格式。它的出現爲我破碎。 – jsw 2011-04-26 02:40:35

+1

你真的想完成什麼?有什麼要求?我可以在代碼中看到很多問題,但是如果不知道需求,我不確定指向哪個方向。例如,你的第一個while循環是毫無意義的。你只是試圖去除你的記錄?或者你只需​​要用同一個INT標記所有匹配的記錄?你最終的目標是什麼? – 2011-04-26 03:04:58

+0

是標誌匹配的記錄具有相同的int – nikhil 2011-04-26 06:04:56

回答

1

我會去純粹的SQL解決方案,如:

SELECT 
    GROUP_CONCAT(id SEPARATOR ' '), "name" 
FROM proj 
WHERE 
    LENGTH(company_name_metaphone) < 9 AND 
    LENGTH(company_name_metaphone) > 3 
GROUP BY country, UPPER(company_name_text) 
HAVING COUNT(*) > 1 
UNION 
SELECT 
    GROUP_CONCAT(id SEPARATOR ' '), "metaphone" 
FROM proj 
WHERE 
    LENGTH(company_name_metaphone) > 9 
GROUP BY country, LEFT(company_name_metaphone, 9) 
HAVING COUNT(*) > 1 

然後遍歷這個結果來更新ID。

+0

謝謝你的幫助! – nikhil 2011-04-26 22:57:28

0

我不知道你的正在嘗試做的,但我可以在你的代碼中看到的是,你賺了很多搜索的數組中有很多數據,我覺得你的問題是你的PHP代碼而不是SQL語句。

+0

是的,但那件事節省了我的時間不呢? – nikhil 2011-04-26 06:05:32

0

,你需要通過字段調整組,以滿足您的匹配要求

如果你的腳本超時(很可能是由於大量的數據),參數或者set_time_limit(0) 否則,你還可以添加一個限制爲1000或$ sql,並且多次運行該腳本,因爲where子句將排除已處理的任何匹配行(但不會跟蹤$ match_no中間調用,因此您需要自行處理)

// find all companies that have multiple rows grouped by identifying fields 

$sql = "select company_name, country, COUNT(*) as num_matches from proj 
where match_no = 0 
group by company_name, country 
having num_matches > 1"; 

$res = mysql_query($sql); 

$match_no = 1; 

// loop through all duplicate companies, and set match_id 
while ($row = mysql_fetch_assoc($res)) { 

    $company_name = mysql_escape_string($row['company_name']); 
    $country = mysql_escape_string($row['country']); 

    $sql = "update proj set match_no = $match_no where 
     company_name = '$company_name', country = '$country'; 

    mysql_query($sql); 

    $match_no++; 
}