2016-03-28 366 views
0

我有一個數據集,其中一列的標題是「什麼是您的位置和時區?」使用python從文本中提取城市名稱

這意味着,我們有像

  1. 丹麥項,CET
  2. 地點是英國德文郡,GMT時區
  3. 澳大利亞。澳洲東部標準時間。 + 10h UTC。

甚至

  • 我的位置是俄勒岡州尤金市全年大部分時間還是在首爾, 韓國因學校放假。我的主要時區是太平洋時區的 。
  • 對於整個五月我會在英國倫敦(GMT + 1)。在整個六月,我將在挪威(GMT + 2)或以色列 (格林威治標準時間+3)與有限的互聯網接入。對於整個七月和八月 我將在英國倫敦(格林威治標準時間+ 1)。然後從 月,2015年,我公司將在美國波士頓(EDT)
  • 有沒有辦法從這個提取城市,國家和時區?

    我正在考慮創建一個包含所有國家/地區名稱(包括簡短形式)以及城市名稱/時區的數組(包含開放源數據集),然後如果數據集中的任何單詞與城市/國家/時區或簡短形式將其填充到同一數據集中的新列並對其進行計數。

    這是否實用?

    =========== REPLT基於NLTK ANSWER ============

    運行相同的代碼,Alecxe我得到

    Traceback (most recent call last): 
        File "E:\SBTF\ntlk_test.py", line 19, in <module> 
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag 
        tagger = PerceptronTagger() 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__ 
        self.load(AP_MODEL_LOC) 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load 
        self.model.weights, self.tagdict, self.classes = load(loc) 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load 
        opened_resource = _open(resource_url) 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open 
        return urlopen(resource_url) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen 
        return opener.open(url, data, timeout) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open 
        response = self._open(req, data) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open 
        'unknown_open', req) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain 
        result = func(*args) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open 
        raise URLError('unknown url type: %s' % type) 
    URLError: <urlopen error unknown url type: c> 
    

    回答

    4

    我會使用自然語言處理和nltk必須提供以提取實體

    示例(很大程度上基於this gist)對文件中的每一行進行標記,將其拆分爲塊並以遞歸方式查找每個塊的NE(命名實體)標籤。更多解釋here

    import nltk 
    
    def extract_entity_names(t): 
        entity_names = [] 
    
        if hasattr(t, 'label') and t.label: 
         if t.label() == 'NE': 
          entity_names.append(' '.join([child[0] for child in t])) 
         else: 
          for child in t: 
           entity_names.extend(extract_entity_names(child)) 
    
        return entity_names 
    
    with open('sample.txt', 'r') as f: 
        for line in f: 
         sentences = nltk.sent_tokenize(line) 
         tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 
         tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 
         chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True) 
    
         entities = [] 
         for tree in chunked_sentences: 
          entities.extend(extract_entity_names(tree)) 
    
         print(entities) 
    

    對於含有sample.txt

    Denmark, CET 
    Location is Devon, England, GMT time zone 
    Australia. Australian Eastern Standard Time. +10h UTC. 
    My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone. 
    For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT) 
    

    它打印:

    ['Denmark', 'CET'] 
    ['Location', 'Devon', 'England', 'GMT'] 
    ['Australia', 'Australian Eastern Standard Time'] 
    ['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific'] 
    ['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT'] 
    

    輸出是不理想,但可能是一個良好的開端爲您服務。

    +2

    這是如何工作的?好像是巫術 – Keatinge

    +2

    @Racialz'nltk'經常令人驚訝!我遠不是NLP的專家,但試圖增加更多的解釋和鏈接進一步閱讀。感謝您詢問詳細信息! – alecxe

    +0

    輝煌。我不知道NTLK - 我會試驗這個,然後(希望)接受答案:-) – GeorgeC

    相關問題