固定寬度的文本文件到Python字典

-2

我想在Python中導入一個類似於下面報告的文本文件。固定寬度的文本文件到Python字典

+ CATEGORY_1 first_part of long attribute <NAME_a> 
|  ...second part of long attribute 
| + CATEGORY_2: a sequence of attributes that extend over 
| |  ... possibly many <NAME_b> 
| |  ... lines 
| | + SOURCE_1 => source_code 
| + CATEGORY_2: another sequence of attributes that extend over <NAME_c> 
| |  ... possibly many lines 
| | + CATEGORY_1: yet another sequence of <NAME_d> attributes that extend over 
| | |  ...many lines 
| | | + CATEGORY_2: I really think <NAME_e> that 
| | | |  ... you got the point 
| | | |  ... now 
| | | | + SOURCE_1 => source_code 
| + SOURCE_2 => path_to_file

凡認爲我可以很容易地識別對象的名稱由<爲分隔...>

我的理想輸出將是一個Python字典反映txt文件的層次結構，所以例如：

{NAME_a : {'category' : CATEGORY_1, 
      'depencencies' : {NAME_b : {'category' : CATEGORY_2, 
             'source_type' : SOURCE_1, 
             'source_code' : source_code} 
          NAME_c : {'category' : CATEGORY_2, 
             'dependencies' : { NAME_d : {'category' : CATEGORY_1, 
                    'dependencies' : NAME_e : {'category' : CATEGORY_2, 
                           'source_type' : SOURCE_1, 
                           'source_code' : source_code} 
                    } 
                 }   
      'source_type' : SOURCE_2, 
      'source_code : path_to_file 
      } 
}

在認爲這裏的主要想法是在行開始之前計算標籤數量，這將決定層次結構。我試圖看看熊貓read_fwf和numpy loadfromtxt，但沒有任何成功。你能指點我相關的模塊或策略來解決這個問題嗎？

來源

2016-11-18 FLab

對如何處理這個問題的任何暗示將不勝感激。不只是尋找「開箱即用」解決方案。 – FLab

策略：由於您的數據結構是平坦的（這是一個文本文件），因此您需要開發自己的解析器來猜測級別，識別名稱......要構建字典結構，您需要一個堆棧。 –

不是一個完整的答案，但你可以按照使用堆棧的方法。

每次輸入類別時，都會將類別鍵推入堆棧。然後你閱讀這一行，檢查標籤的數量並存儲。如果級別與先前的級別相同或更高，則從堆棧中彈出一個項目。然後你只需要基本的正則表達式來提取項目。

一些Python /僞代碼，所以你可以有一個想法

levels = [] 
items = {} 
last_level = 0 

for line in file: 
    current_level = count_tabs() 
    if current_level > last_level: 
     name = extract_name(line) 
     levels.append(name) 
     items = fill_dictionary_in_level(name, line) 
    else: 
     levels.pop() 
    last_level = current_level 

return items

來源

2016-11-18 11:25:20 danielfranca

這裏是一個戰略：

對於每一行，使用正則表達式來解析線和提取數據。

這裏是一個草案：

import re 

line = "| + CATEGORY_2: another sequence of attributes that extend over <NAME_c>" 

level = line.count("|") + 1 
mo = re.match(r".*\+\s+(?P<category>[^:]+):.*<(?P<name>[^>]+)>", line) 
category = mo.group("category") 
name = mo.group("name") 

print("level: {0}".format(level)) 
print("category: {0}".format(category)) 
print("name: {0}".format(name))

你得到：

level: 2 
category: CATEGORY_2 
name: NAME_c

來源

2016-11-18 11:25:23

固定寬度的文本文件到Python字典

回答

相關問題