未能實現pattern_capture標記過濾器

我試圖實現一個Elasticsearch pattern_capture過濾器，可以將EDR-00004轉換爲令牌：[EDR-00004，00004，4]。我（仍然）使用Elasticsearch 2.4，但與當前ES版本的文檔沒有區別。未能實現pattern_capture標記過濾器

我按照文檔中的例子： https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html

這是我的測試和結果：

curl -XPUT 'localhost:9200/test_index' -d '{ 
    "settings": { 
     "analysis": { 
      "filter": { 
       "process_number_filter": { 
        "type": "pattern_capture", 
        "preserve_original": 1, 
        "patterns": [ 
         "([A-Za-z]+-([0]+([0-9]+)))" 
        ] 
       } 
      }, 
      "analyzer": { 
       "process_number_analyzer": { 
        "type": "custom", 
        "tokenizer": "pattern", 
        "filter": ["process_number_filter"] 
       } 
      } 
     } 
    } 
}' 

curl -XGET 'localhost:9200/test_index/_analyze' -d ' 
{ 
    "analyzer": "process_number_analyzer", 
    "text": "EDR-00002" 
}' 

curl -XGET 'localhost:9200/test_index/_analyze' -d ' 
{ 
    "analyzer": "standard", 
    "tokenizer": "standard", 
    "filter": ["process_number_filter"], 
    "text": "EDR-00002" 
}'

{"acknowledged":true} 

{ 
    "tokens": [{ 
     "token": "EDR", 
     "start_offset": 0, 
     "end_offset": 3, 
     "type": "word", 
     "position": 0 
    }, { 
     "token": "00002", 
     "start_offset": 4, 
     "end_offset": 9, 
     "type": "word", 
     "position": 1 
    }] 
} 

{ 
    "tokens": [{ 
     "token": "edr", 
     "start_offset": 0, 
     "end_offset": 3, 
     "type": "<ALPHANUM>", 
     "position": 0 
    }, { 
     "token": "00002", 
     "start_offset": 4, 
     "end_offset": 9, 
     "type": "<NUM>", 
     "position": 1 
    }] 
}

我明白

我不需要將整個正則表達式分組，因爲我有preserve_original集合
我可以用\ d和/或\ w替換東西，但這種方式我不必考慮轉義。

也確保我的正則表達式是正確的。

>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")                                             
>>> m.groups() 
('EDR-00004', '00004', '4')

來源

2017-03-06 Blackeagle52

我討厭回答我自己的問題，但我找到了答案，也許它可以幫助未來的人。

我的問題是默認的標記器，它會在將文本傳遞到我的過濾器之前拆分文本。通過添加我自己的分詞器，該分詞器將默認分配器"\W+"重寫爲"[^\\w-]+"，我的過濾器接收到了整個單詞，從而創建了權限令牌。

現在這是我的自定義設置：

curl -XPUT 'localhost:9200/test_index' -d '{ 
    "settings": { 
     "analysis": { 
      "filter": { 
       "process_number_filter": { 
        "type": "pattern_capture", 
        "preserve_original": 1, 
        "patterns": [ 
         "([A-Za-z]+-([0]+([0-9]+)))" 
        ] 
       } 
      }, 
      "tokenizer": { 
       "process_number_tokenizer": { 
        "type": "pattern", 
        "pattern": "[^\\w-]+" 
       } 
      }, 
      "analyzer": { 
       "process_number_analyzer": { 
        "type": "custom", 
        "tokenizer": "process_number_tokenizer", 
        "filter": ["process_number_filter"] 
       } 
      } 
     } 
    } 
}'

從而導致以下結果：

{ 
    "tokens": [ 
     { 
      "token": "EDR-00002", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     }, 
     { 
      "token": "00002", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     }, 
     { 
      "token": "2", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     } 
    ] 
}

來源

2017-03-06 17:44:42 Blackeagle52

未能實現pattern_capture標記過濾器

回答

相關問題