2012-04-06 75 views
3

作爲項目的一部分,我和其他一些人正在處理URL分類。我們試圖實現的其實很簡單:只需查看URL並在其中找到相關關鍵字並相應地對頁面進行分類。URL分類的模式匹配

例如:如果URL是:http://cnnworld/sports/abcd,我們將其劃分類別下的「體育」

要做到這一點,我們有格式的映射數據庫:關鍵詞 - >類別現在

我們目前正在做的是,對於每個URL,我們都會繼續讀取數據庫中的所有數據項,並使用String.find()方法來查看關鍵字是否出現在URL中。一旦找到了,我們就停下來。

但這種方法有一些問題,主要是:

(一)我們的數據庫非常大,這樣的重複查詢的運行極爲緩慢

(II)頁面可以屬於不止一類和我們的方法不處理這種情況。當然,確保這一點的一個簡單方法是在發現類別匹配後繼續查詢數據庫,但這隻會讓事情變得更慢。

我在考慮替代品,並想知道是否可以完成反向工作 - 解析url,找到其中的單詞,然後僅查詢這些單詞的數據庫。

一個樸素的算法可以在O(n^2)中運行 - 查詢數據庫中的所有發生在url內的子字符串。

我想知道是否有更好的方法來實現這一點。有任何想法嗎 ??預先感謝您:)

回答

0

如果您的類別少於(多)個關鍵字,則可以爲每個類別創建一個正則表達式,它將匹配該類別的任何關鍵字。然後你會針對每個類別的正則表達式運行你的URL。這也將解決匹配多個類別的問題。

0

我認爲你的建議是拆分URL以找到有用的位然後查詢這些項目聽起來像是一個體面的方式。

我一起扔了一些Java,可能有助於說明我認爲這將需要的代碼方式。最有價值的部分是可能的正則表達式,但我希望它的通用算法可以幫助一些還有:

import java.io.UnsupportedEncodingException; 
import java.net.URLDecoder; 
import java.util.List; 

public class CategoryParser 
{ 
    /** The db field that keywords should be checked against */ 
    private static final String DB_KEYWORD_FIELD_NAME = "keyword"; 

    /** The db field that categories should be pulled from */ 
    private static final String DB_CATEGORY_FIELD_NAME = "category"; 

    /** The name of the table to query */ 
    private static final String DB_TABLE_NAME = "KeywordCategoryMap"; 

    /** 
    * This method takes a URL and from that text alone determines what categories that URL belongs in. 
    * @param url - String URL to categorize 
    * @return categories - A List<String&rt; of categories the URL seemingly belongs in 
    */ 
    public static List<String> getCategoriesFromUrl(String url) { 

     // Clean the URL to remove useless bits and encoding artifacts 
     String normalizedUrl = normalizeURL(url); 

     // Break the url apart and get the good stuff 
     String[] keywords = tokenizeURL(normalizedUrl); 

     // Construct the query we can query the database with 
     String query = constructKeywordCategoryQuery(keywords); 

     System.out.println("Generated Query: " + query); 

     // At this point, you'd need to fire this query off to your database, 
     // and the results you'd get back should each be a valid category 
     // for your URL. This code is not provided because it's very implementation specific, 
     // and you already know how to deal with databases. 


     // Returning null to make this compile, even though you'd obviously want to return the 
     // actual List of Strings 
     return null; 
    } 

    /** 
    * Removes the protocol, if it exists, from the front and 
    * removes any random encoding characters 
    * Extend this to do other url cleaning/pre-processing 
    * @param url - The String URL to normalize 
    * @return normalizedUrl - The String URL that has no junk or surprises 
    */ 
    private static String normalizeURL(String url) 
    { 
     // Decode URL to remove any %20 type stuff 
     String normalizedUrl = url; 
     try { 
      // I've used a URLDecoder that's part of Java here, 
      // but this functionality exists in most modern languages 
      // and is universally called url decoding 
      normalizedUrl = URLDecoder.decode(url, "UTF-8"); 
     } 
     catch(UnsupportedEncodingException uee) 
     { 
      System.err.println("Unable to Decode URL. Decoding skipped."); 
      uee.printStackTrace(); 
     } 

     // Remove the protocol, http:// ftp:// or similar from the front 
     if (normalizedUrl.contains("://")) 
     { 
      normalizedUrl = normalizedUrl.split(":\\/\\/")[1]; 
     } 

     // Room here to do more pre-processing 

     return normalizedUrl; 
    } 

    /** 
    * Takes apart the url into the pieces that make at least some sense 
    * This doesn't guarantee that each token is a potentially valid keyword, however 
    * because that would require actually iterating over them again, which might be 
    * seen as a waste. 
    * @param url - Url to be tokenized 
    * @return tokens - A String array of all the tokens 
    */ 
    private static String[] tokenizeURL(String url) 
    { 
     // I assume that we're going to use the whole URL to find tokens in 
     // If you want to just look in the GET parameters, or you want to ignore the domain 
     // or you want to use the domain as a token itself, that would have to be 
     // processed above the next line, and only the remaining parts split 
     String[] tokens = url.split("\\b|_"); 

     // One could alternatively use a more complex regex to remove more invalid matches 
     // but this is subject to your (?:in)?ability to actually write the regex you want 

     // These next two get rid of tokens that are too short, also. 

     // Destroys anything that's not alphanumeric and things that are 
     // alphanumeric but only 1 character long 
     //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+"); 

     // Destroys anything that's not alphanumeric and things that are 
     // alphanumeric but only 1 or 2 characters long 
     //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+"); 

     return tokens; 
    } 

    private static String constructKeywordCategoryQuery(String[] keywords) 
    { 
     // This will hold our WHERE body, keyword OR keyword2 OR keyword3 
     StringBuilder whereItems = new StringBuilder(); 

     // Potential query, if we find anything valid 
     String query = null; 

     // Iterate over every found token 
     for (String keyword : keywords) 
     { 
      // Reject invalid keywords 
      if (isKeywordValid(keyword)) 
      { 
       // If we need an OR 
       if (whereItems.length() > 0) 
       { 
        whereItems.append(" OR "); 
       } 

       // Simply append this item to the query 
       // Yields something like "keyword='thisKeyword'" 
       whereItems.append(DB_KEYWORD_FIELD_NAME); 
       whereItems.append("='"); 
       whereItems.append(keyword); 
       whereItems.append("'"); 
      } 
     } 

     // If a valid keyword actually made it into the query 
     if (whereItems.length() > 0) 
     { 
      query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME 
        + " WHERE " + whereItems.toString() + ";"; 
     } 

     return query; 
    } 

    private static boolean isKeywordValid(String keyword) 
    { 
     // Keywords better be at least 2 characters long 
     return keyword.length() > 1 
       // And they better be only composed of letters and numbers 
       && keyword.matches("\\w+") 
       // And they better not be *just* numbers 
       // && !keyword.matches("\\d+") // If you want this 
       ; 
    } 

    // How this would be used 
    public static void main(String[] args) 
    { 
     List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification"); 
     List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965"); 
    } 
} 

的SO鏈接生成的查詢將如下所示:房間

SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification' 

充足爲優化,但我想它比檢查每個可能的關鍵字字符串要快得多。

0

Aho-corasick算法最適合用一次遍歷搜索中間字符串。您可以形成關鍵字的樹(aho-corasick樹)。在最後一個節點包含一個映射到特定關鍵字的數字。

現在,您只需遍歷樹上的URL字符串。當你得到一些數字(在我們的場景中作爲標誌工作),這意味着我們得到了一些映射類別。在哈希映射上繼續使用該數字,並找到相應的類別以供進一步使用。

我想這會幫助你。 轉到此鏈接:good animation of aho-corasick by ivan

2

在我們的商業分類,我們有4M的關鍵字數據庫:)我們也搜索HTML的身體,有辦法解決這個號碼:

  1. 使用Aho-Corasick,我們使用了一種專門用於處理網頁內容的修改算法,例如,將tab:space,\ r,\ n視爲空格,因爲只有一個,因此兩個空格將被視爲一個空格,並且也忽略下/大寫。
  2. 另一種選擇是將所有關鍵字放在樹中(例如std :: map),這樣搜索變得非常快,缺點是這需要內存,而且很多,但如果它在服務器上,那麼你不會感覺不到。
+0

請不要粘貼簽名或添加任何帖子或問題。如果你願意,你可以在你的個人資料上使用它,但它會在SO線程上產生噪音。 – ForceMagic 2012-10-14 01:00:57