從ElasticSearch-JSON文件獲取數據到Python

如何將查詢結果獲取到保留分層結構的列的數據框？這樣的列：從ElasticSearch-JSON文件獲取數據到Python

type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|

我有一個elasticSearch與大約1,000,000 JSOn文檔。我想用這個數據集用於Python的自然語言處理（NLP）。有人可以請我幫助我如何從elasticsearch中獲取數據到Python並將數據寫回Python的elasticsearch。非常感謝，因爲我無法對數據集執行任何NLP，因爲我無法使用它來連接Python。這就是elasticsearch的索引結構：
我想在「層次結構信息」中輸入層次結構中的新索引，並且此新索引將根據一組我給的關鍵字 - 就像「universityKeywords」一樣，每個jason文件都應該存儲標籤使用的關鍵字集合。我要標記的數據集分爲「過程信息」 - 提上了JSON文件named-應用，報價，擴招，基於關鍵字的JSON文件後標題要求4個標籤或分類和發佈文本

"educationforumsenriched2": { 
      "mappings": { 
      "whirlpool": { 
       "properties": { 
        "CourseInfo": { 
         "properties": { 
         "courses": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "subjectKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "SentimentInfo": { 
         "properties": { 
         "SentiStrength": { 
          "type": "float" 
         }, 
         "SentiWordNet": { 
          "type": "float" 
         } 
         } 
        }, 
        "UniversityInfo": { 
         "properties": { 
         "universities": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "universityKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "postDate": { 
         "type": "date", 
         "format": "strict_date_optional_time||epoch_millis" 
        }, 
        "postID": { 
         "type": "integer" 
        }, 
        "postText": { 
         "type": "string" 
        }, 
        "references": { 
         "type": "string" 
        }, 
        "threadID": { 
         "type": "integer" 
        }, 
        "threadTitle": { 
         "type": "string" 
        } 
       } 
      }, 
      "atarnotes": { 
       "properties": { 
        "CourseInfo": { 
         "properties": { 
         "courses": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "subjectKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "SentimentInfo": { 
         "properties": { 
         "SentiStrength": { 
          "type": "float" 
         }, 
         "SentiWordNet": { 
          "type": "float" 
         } 
         } 
        }, 
        "UniversityInfo": { 
         "properties": { 
         "universities": { 
          "type": "string", 
          "index": "not_analyzed" 
         }, 
         "universityKeywords": { 
          "type": "string", 
          "index": "not_analyzed" 
         } 
         } 
        }, 
        "discussionTitle": { 
         "type": "string" 
        }, 
        "postDate": { 
         "type": "date", 
         "format": "strict_date_optional_time||epoch_millis" 
        }, 
        "postID": { 
         "type": "integer" 
        }, 
        "postText": { 
         "type": "string" 
        }, 
        "query": { 
         "properties": { 
         "match_all": { 
          "type": "object" 
         } 
         } 
        }, 
        "threadID": { 
         "type": "integer" 
        }, 
        "threadTitle": { 
         "type": "string" 
        } 
       } 
      } 
      } 
     } 
    }

這是我用於創建基於Java的過程中信息的標籤我想做同樣在Python

processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications"))); 
     processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering"))); 
     processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled"))); 
     processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));

來源

2017-06-03 BA stu

[Python Elasticsearch客戶端]（https://elasticsearch-py.readthedocs.io/en/master/）？ –

pyelasticsearch？我已經安裝了軟件包 - 但無法弄清楚如何讓這個數據集到Python。一個小例子將非常有用。這是我elasticsearch指數的映射結構： –

「educationforumsenriched2」：{ 「映射」：{ 「漩渦」：{ 「屬性」：{ 「CourseInfo」：{.. –

隨着elasticsearch python client，一旦你成功建立連接的代碼，你只需要提供的DSL查詢和你想要搜索的索引來檢索所需的信息，例如，如果你有一個查詢：

GET educationforumsenriched2/_search 
{ 
    "query": { 
     "match" : { 
      "CourseInfo.subjectKeywords" : "foo" 
     } 
    } 
}

Python中的等效是：

from elasticsearch import Elasticsearch 

es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on 

query = { 
     "query": { 
      "match" : { 
       "CourseInfo.subjectKeywords" : "foo" 
      } 
     } 
    } 
res = es.search(index="educationforumsenriched2", body=query) 

#do some processing 

#create new document in ES 
es.create(index="educationforumsenriched2", body=new_doc_after_processing)

編輯：一想到它，但如果你的處理是不是太複雜，你也可以考慮建立一個ingest pipeline

來源

2017-06-03 14:09:06 Adonis

謝謝。但是我怎樣才能把結果變成象結構一樣的數據框，並且這些數據框中的字段可以像列表一樣編輯 –

@BAstu我們在談論什麼樣的數據框，熊貓數據框？ Spark數據幀？也許這個問題可以幫助：https://stackoverflow.com/questions/25186148/creating-dataframe-from-elasticsearch-results – Adonis

是一個熊貓數據框。謝謝 –

從ElasticSearch-JSON文件獲取數據到Python

回答

相關問題