2016-02-12 56 views
1

我想索引使用path_hierarchy標記器的路徑,但它似乎是標記化只有一半我提供的路徑。我嘗試過不同的路徑,結果似乎相同。Elasticsearch path_hierarchy標記化路徑的一半

我的設定 -

{ 
    "settings" : { 
     "number_of_shards" : 5, 
     "number_of_replicas" : 0, 
     "analysis":{ 
      "analyzer":{ 
       "keylower":{ 
        "type": "custom", 
        "tokenizer":"keyword", 
        "filter":"lowercase" 
       }, 
       "path_analyzer": { 
        "type": "custom", 
        "tokenizer": "path_tokenizer", 
        "filter": [ "lowercase", "asciifolding", "path_ngrams" ] 
       }, 
       "code_analyzer": { 
        "type": "custom", 
        "tokenizer": "standard", 
        "filter": [ "lowercase", "asciifolding", "code_stemmer" ] 
       }, 
       "not_analyzed": { 
        "type": "custom", 
        "tokenizer": "keyword", 
        "filter": [ "lowercase", "asciifolding", "code_stemmer" ] 
       } 
      }, 
      "tokenizer": { 
       "path_tokenizer": { 
        "type": "path_hierarchy" 
       } 
      }, 
      "filter": { 
       "path_ngrams": { 
        "type": "edgeNGram", 
        "min_gram": 3, 
        "max_gram": 15 
       }, 
       "code_stemmer": { 
        "type": "stemmer", 
        "name": "minimal_english" 
       } 
      } 
     } 
    } 
} 

我的映射如下 -

{ 
    "dynamic": "strict", 
    "properties": { 
    "depot_path": { 
     "type": "string", 
     "analyzer": "path_analyzer" 
    } 
    }, 
    "_all": { 
     "store": "yes", 
     "analyzer": "english" 
    } 
} 

我在分析我已經發現如下該令牌形成提供"//cm/mirror/v1.2/Kolkata/ixin-packages/builds/"depot_path -

   "key": "//c", 
       "key": "//cm", 
       "key": "//cm/", 
       "key": "//cm/m", 
       "key": "//cm/mi", 
       "key": "//cm/mir", 
       "key": "//cm/mirr", 
       "key": "//cm/mirro", 
       "key": "//cm/mirror", 
       "key": "//cm/mirror/", 
       "key": "//cm/mirror/v", 
       "key": "//cm/mirror/v1", 
       "key": "//cm/mirror/v1.", 

爲什麼整個路徑不是符號化?

我的預期成果是已經形成的令牌所有高達//cm/mirror/v1.2/Kolkata/ixin-packages/builds/

我曾嘗試增加緩衝區大小,但沒有運氣的方式。有誰知道我做錯了什麼?

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html

回答

1

"max_gram": 15被限制令牌大小爲15。如果你增加"max_gram",你會看到進一步的路徑將被標記化。

下面是我的環境示例。

"max_gram" :15 
input path : /var/log/www/html/web/ 
path_analyser tokenized this path upto /var/log/www/ht i.e. 15 characters 


"max_gram" :100 
    input path : /var/log/www/html/web/WANTED 
    path_analyser tokenized this path upto /var/log/www/html/web/WANTED i.e. 28 characters <100 
+0

謝謝:)我決定只是擺脫'path_ngrams'過濾器。 –

1

這是因爲你的"max_gram"值設置爲15。因此,您會注意到生成的最大標記(「// cm/mirror/v1。」)的長度爲15。將其更改爲一個非常大的數字,您將獲得所需的令牌。

+0

謝謝:)接受Shubhangi的回答,因爲她在16秒內擊敗了你。 :) –