2011-01-27 78 views
0

我有很大的表(超過100萬行),這些行有不同來源的產品名稱和價格。Oracle數據庫中的部分匹配

有很多同名產品,但價格不同。

這是問題所在,

我們有相同的產品多次在行,但他們的名字不會是相同的,例如

Row Product name    price 
----- ----------------------- ---- 
Row 1 : XYZ - size information $a 
Row 2. XYZ -Brand information $b 
Row 3. xyz      $c 

我想它的價格有所不同的所有產品。如果名稱是行相同,則我可以很容易地去爲自己加入成爲Table1.Product_Name = Table1.Product_name和Table1.Price!= Table2.Price

但這不會在這種情況下:(

可以工作任何一個提出一個解決方案,這

回答

3

你可以嘗試使用regexp_replace進入正確的方向:

create table tq84_products (
    name varchar2(50), 
    price varchar2(5) 
); 

三種產品:

  • XYZ
  • ABCD這
  • efghi

的ABCD有兩條記錄具有相同的價格和所有其他有不同的價格。

insert into tq84_products values (' XYZ - size information', '$a'); 
insert into tq84_products values ('XYZ - brand information', '$b'); 
insert into tq84_products values ('xyz'     , '$c'); 

insert into tq84_products values ('Product ABCD'   , '$d'); 
insert into tq84_products values ('Abcd is the best'  , '$d'); 

insert into tq84_products values ('efghi is cheap'   , '$f'); 
insert into tq84_products values ('no, efghi is expensive' , '$g'); 

停止詞 SELECT語句刪除通常在產品名稱中找到的單詞。

with split_into_words as (
     select 
     name, 
     price, 
     upper (
     regexp_replace(name, 
          '\W*' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?'  || 
         '.*', 
         '\' || submatch.counter 
        ) 
     )       word 
     from 
      tq84_products, 
      (select 
       rownum counter 
      from 
       dual 
      connect by 
       level < 10 
      ) submatch 
), 
    stop_words as (
    select 'IS'   word from dual union all 
    select 'BRAND'  word from dual union all 
    select 'INFORMATION' word from dual 
) 
    select 
    w1.price, 
    w2.price, 
    w1.name, 
    w2.name 
-- substr(w1.word, 1, 30)    common_word, 
-- count(*) over (partition by w1.name) cnt 
    from 
    split_into_words w1, 
    split_into_words w2 
    where 
    w1.word = w2.word and 
    w1.name < w2.name and 
    w1.word is not null and 
    w2.word is not null and 
    w1.word not in (select word from stop_words) and 
    w2.word not in (select word from stop_words) and 
    w1.price != w2.price; 

這則選擇

$a $b  XYZ - size information       XYZ - brand information 
$b $c XYZ - brand information       xyz 
$a $c  XYZ - size information       xyz 
$f $g efghi is cheap          no, efghi is expensive 

那麼,是不是返回ABCD,而其他人。

+0

我會試試這個。 – onsy 2011-01-27 08:55:11