2017-03-06 69 views
0

我試圖實現一個Elasticsearch pattern_capture過濾器,可以將EDR-00004轉換爲令牌:[EDR-00004,00004,4]。我(仍然)使用Elasticsearch 2.4,但與當前ES版本的文檔沒有區別。未能實現pattern_capture標記過濾器

我按照文檔中的例子: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html

這是我的測試和結果:

curl -XPUT 'localhost:9200/test_index' -d '{ 
    "settings": { 
     "analysis": { 
      "filter": { 
       "process_number_filter": { 
        "type": "pattern_capture", 
        "preserve_original": 1, 
        "patterns": [ 
         "([A-Za-z]+-([0]+([0-9]+)))" 
        ] 
       } 
      }, 
      "analyzer": { 
       "process_number_analyzer": { 
        "type": "custom", 
        "tokenizer": "pattern", 
        "filter": ["process_number_filter"] 
       } 
      } 
     } 
    } 
}' 

curl -XGET 'localhost:9200/test_index/_analyze' -d ' 
{ 
    "analyzer": "process_number_analyzer", 
    "text": "EDR-00002" 
}' 

curl -XGET 'localhost:9200/test_index/_analyze' -d ' 
{ 
    "analyzer": "standard", 
    "tokenizer": "standard", 
    "filter": ["process_number_filter"], 
    "text": "EDR-00002" 
}' 

返回:

{"acknowledged":true} 

{ 
    "tokens": [{ 
     "token": "EDR", 
     "start_offset": 0, 
     "end_offset": 3, 
     "type": "word", 
     "position": 0 
    }, { 
     "token": "00002", 
     "start_offset": 4, 
     "end_offset": 9, 
     "type": "word", 
     "position": 1 
    }] 
} 

{ 
    "tokens": [{ 
     "token": "edr", 
     "start_offset": 0, 
     "end_offset": 3, 
     "type": "<ALPHANUM>", 
     "position": 0 
    }, { 
     "token": "00002", 
     "start_offset": 4, 
     "end_offset": 9, 
     "type": "<NUM>", 
     "position": 1 
    }] 
} 

我明白

  1. 我不需要將整個正則表達式分組,因爲我有preserve_original集合
  2. 我可以用\ d和/或\ w替換東西,但這種方式我不必考慮轉義。

也確保我的正則表達式是正確的。

>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")                                             
>>> m.groups() 
('EDR-00004', '00004', '4') 

回答

0

我討厭回答我自己的問題,但我找到了答案,也許它可以幫助未來的人。

我的問題是默認的標記器,它會在將文本傳遞到我的過濾器之前拆分文本。通過添加我自己的分詞器,該分詞器將默認分配器"\W+"重寫爲"[^\\w-]+",我的過濾器接收到了整個單詞,從而創建了權限令牌。

現在這是我的自定義設置:

curl -XPUT 'localhost:9200/test_index' -d '{ 
    "settings": { 
     "analysis": { 
      "filter": { 
       "process_number_filter": { 
        "type": "pattern_capture", 
        "preserve_original": 1, 
        "patterns": [ 
         "([A-Za-z]+-([0]+([0-9]+)))" 
        ] 
       } 
      }, 
      "tokenizer": { 
       "process_number_tokenizer": { 
        "type": "pattern", 
        "pattern": "[^\\w-]+" 
       } 
      }, 
      "analyzer": { 
       "process_number_analyzer": { 
        "type": "custom", 
        "tokenizer": "process_number_tokenizer", 
        "filter": ["process_number_filter"] 
       } 
      } 
     } 
    } 
}' 

從而導致以下結果:

{ 
    "tokens": [ 
     { 
      "token": "EDR-00002", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     }, 
     { 
      "token": "00002", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     }, 
     { 
      "token": "2", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     } 
    ] 
}