2017-06-03 44 views
1

如何將查詢結果獲取到保留分層結構的列的數據框?這樣的列:從ElasticSearch-JSON文件獲取數據到Python

type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords| 

我有一個elasticSearch與大約1,000,000 JSOn文檔。 我想用這個數據集用於Python的自然語言處理(NLP)。 有人可以請我幫助我如何從elasticsearch中獲取數據到Python並將數據寫回Python的elasticsearch。 非常感謝,因爲我無法對數據集執行任何NLP,因爲我無法使用它來連接Python。 這就是elasticsearch的索引結構:
我想在「層次結構信息」 中輸入層次結構中的新索引,並且此新索引將根據一組我給的關鍵字 - 就像「universityKeywords」一樣,每個jason文件都應該存儲標籤使用的關鍵字集合。 我要標記的數據集分爲「過程信息」 - 提上了JSON文件named-應用,報價,擴招,基於關鍵字的JSON文件後標題要求4個標籤或分類和發佈文本

"educationforumsenriched2": { 
      "mappings": { 
      "whirlpool": { 
       "properties": { 
        "CourseInfo": { 
         "properties": { 
         "courses": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "subjectKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "SentimentInfo": { 
         "properties": { 
         "SentiStrength": { 
          "type": "float" 
         }, 
         "SentiWordNet": { 
          "type": "float" 
         } 
         } 
        }, 
        "UniversityInfo": { 
         "properties": { 
         "universities": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "universityKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "postDate": { 
         "type": "date", 
         "format": "strict_date_optional_time||epoch_millis" 
        }, 
        "postID": { 
         "type": "integer" 
        }, 
        "postText": { 
         "type": "string" 
        }, 
        "references": { 
         "type": "string" 
        }, 
        "threadID": { 
         "type": "integer" 
        }, 
        "threadTitle": { 
         "type": "string" 
        } 
       } 
      }, 
      "atarnotes": { 
       "properties": { 
        "CourseInfo": { 
         "properties": { 
         "courses": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "subjectKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "SentimentInfo": { 
         "properties": { 
         "SentiStrength": { 
          "type": "float" 
         }, 
         "SentiWordNet": { 
          "type": "float" 
         } 
         } 
        }, 
        "UniversityInfo": { 
         "properties": { 
         "universities": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "universityKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "discussionTitle": { 
         "type": "string" 
        }, 
        "postDate": { 
         "type": "date", 
         "format": "strict_date_optional_time||epoch_millis" 
        }, 
        "postID": { 
         "type": "integer" 
        }, 
        "postText": { 
         "type": "string" 
        }, 
        "query": { 
         "properties": { 
         "match_all": { 
          "type": "object" 
         } 
         } 
        }, 
        "threadID": { 
         "type": "integer" 
        }, 
        "threadTitle": { 
         "type": "string" 
        } 
       } 
      } 
      } 
     } 
    } 

這是我用於創建基於Java的過程中信息的標籤我想做同樣在Python

processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications"))); 
     processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering"))); 
     processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled"))); 
     processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require"))); 
+0

[Python Elasticsearch客戶端](https://elasticsearch-py.readthedocs.io/en/master/)? –

+0

pyelasticsearch?我已經安裝了軟件包 - 但無法弄清楚如何讓這個數據集到Python。一個小例子將非常有用。這是我elasticsearch指數的映射結構: –

+0

「educationforumsenriched2」:{ 「映射」:{ 「漩渦」:{ 「屬性」:{ 「CourseInfo」:{.. –

回答

1

隨着elasticsearch python client,一旦你成功建立連接的代碼,你只需要提供的DSL查詢和你想要搜索的索引來檢索所需的信息,例如,如果你有一個查詢:

GET educationforumsenriched2/_search 
{ 
    "query": { 
     "match" : { 
      "CourseInfo.subjectKeywords" : "foo" 
     } 
    } 
} 

Python中的等效是:

from elasticsearch import Elasticsearch 

es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on 

query = { 
     "query": { 
      "match" : { 
       "CourseInfo.subjectKeywords" : "foo" 
      } 
     } 
    } 
res = es.search(index="educationforumsenriched2", body=query) 

#do some processing 

#create new document in ES 
es.create(index="educationforumsenriched2", body=new_doc_after_processing) 

編輯:一想到它,但如果你的處理是不是太複雜,你也可以考慮建立一個ingest pipeline

+0

謝謝。但是我怎樣才能把結果變成象結構一樣的數據框,並且這些數據框中的字段可以像列表一樣編輯 –

+0

@BAstu我們在談論什麼樣的數據框,熊貓數據框? Spark數據幀?也許這個問題可以幫助:https://stackoverflow.com/questions/25186148/creating-dataframe-from-elasticsearch-results – Adonis

+0

是一個熊貓數據框。謝謝 –