2017-06-17 82 views
1

Python大師,我需要從列表中提取所有文本直到URL,下面是模式的示例。我也希望腳本能夠循環文件夾中的所有文件。Python - 從多個文件中提取多個字符串中的文本

..... 
..... 
<List>Product Line</List> 
<URL>http://teamspace.abb.com/sites/Product</URL> 
... 
... 
<List>Contact Number</List> 
<URL>https://teamspace.abb.com/sites/Contact</URL> 
.... 
.... 

預計輸出

<List>Product Line</List> 
<URL>http://teamspace.abb.com/sites/Product</URL> 
<List>Contact Number</List> 
<URL>https://teamspace.abb.com/sites/Contact</URL> 

我已經開發了一個腳本,能循環所有在文件夾中的文件,然後提取從列表中開頭的所有關鍵字,但我無法包含URL。非常感謝您的幫助。

# defining location of parent folder 
    BASE_DIRECTORY = 'C:\D_Drive\Projects\Test' 
    output_file = open('C:\D_Drive\Projects\\Test\Output.txt', 'w') 
    output = {} 
    file_list = [] 

# scanning through sub folders 
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY): 
for f in filenames: 
    if 'xml' in str(f): 
     e = os.path.join(str(dirpath), str(f)) 
     file_list.append(e) 

for f in file_list: 
print f 
txtfile = open(f, 'r') 
output[f] = [] 
for line in txtfile: 
    if '<List>' in line: 
     output[f].append(line) 
tabs = [] 
for tab in output: 
tabs.append(tab) 

tabs.sort() 
for tab in tabs: 
output_file.write(tab + '\n') 
output_file.write('\n') 
for row in output[tab]: 
    output_file.write(row + '') 
output_file.write('\n') 
output_file.write('----------------------------------------------------------\n') 

raw_input() 

Sample file

+0

輸入和預期的輸出看起來是一樣的。嘗試改善你的問題 – fferri

+0

爲什麼要重新發明車輪?只需使用xml解析器,如[xml樹](https://docs.python.org/2/library/xml.etree.elementtree.html) – dawg

+0

請更新縮進。 –

回答

1

你的答案基本上是正確的唯一的變化需要它來創建一個迭代器爲文件。你可以使用元素樹或美麗的湯,但像這樣的理解迭代也會工作,當它是一個非XML或HTML文件。

txtfile = iter(open(f, 'r')) # change here 
output[f] = [] 
for line in txtfile: 
    if '<List>' in line: 
     output[f].append(line) 
     output[f].append(next(txtfile)) # and here 
+0

優秀!非常感謝 – user1902849

2

嘗試用xml.etree.ElementTree

import xml.etree.ElementTree as ET 
tree = ET.parse('Product_Workflow.xml') 
root = tree.getroot() 
with open('Output.txt','w') as opfile: 
    for l,u in zip(root.iter('List'),root.iter('URL')): 
     opfile.write(ET.tostring(l).strip()) 
     opfile.write('\n') 
     opfile.write(ET.tostring(u).strip()) 
     opfile.write('\n') 

Output.txt會給你:

<List>Emove</List> 
<URL>http://teamspace.abb.com/sites/Product</URL> 
<List>Asset_KWT</List> 
<URL>https://teamspace.slb.com/sites/Contact</URL> 
+0

感謝您的信息。我會看看xml元素的方法。 – user1902849

1

可以使用filter或列表理解像這樣:

tgt=('URL', 'List') 
with open('file') as f: 
    print filter(lambda line: any(e in line for e in tgt), (line for line in f)) 

或者:

with open('/tmp/file') as f: 
    print [line for line in f if any(e in line for e in tgt)] 

或者打印:

[' <List>Product Line</List>\n', ' <URL>http://teamspace.abb.com/sites/Product</URL>\n', ' <List>Contact Number</List>\n', ' <URL>https://teamspace.abb.com/sites/Contact</URL>\n'] 
+0

感謝您的評論,我會看看它。 – user1902849

相關問題