2017-05-08 106 views
0
from bs4 import BeautifulSoup 

list = (glob.glob("/home/anastasiya/PycharmProjects/bachelor/rutexts/*.xhtml")) 
for text in list: 
print(text) 
with open(text, "r", encoding="windows-1251") as file: 
    with open("ruscorpus.txt", "a") as file2: 
     for line in file: 
      soup = BeautifulSoup(line, "lxml") 
      if soup.w is not None: 
         file2.write("{wort}\t{gr}\t{lex}\n".format(
         lex=soup.w.ana.get('lex'), 
         gr=test(soup.w.ana.get('gr')), 
         wort=soup.w.contents[-1])) 

我嘗試從xml獲取一些信息。格式是這樣的。 的運行程序,但如果我們有2個字1瓦特標籤,它採取的第一個與整個標籤輸出: enter image description herexml with BeautifulSoup

+0

爲什麼你是聰明人讀你的'xml'數據線? –

回答

0

1中第一次出現,你的代碼試圖讀取該文件text的線行,然後將它傳遞給BS4解析。我建議你可以直接將打開的文件引用傳遞給bs4。

2,在bs4中,您可以通過find_all找到所有特定標記,如w標記內容。

更改您這樣的代碼:

with open(text, "r", encoding="windows-1251") as file1, open("ruscorpus.txt", "a") as file2: 
    xml_soup = BeautifulSoup(file1,'lxml') 
    for w in xml_soup.find_all('w'): # get all w tag and parse them 
     file2.write("{wort}\t{gr}\t{lex}\n".format(lex=w.ana.get('lex'),gr=w.ana.get('gr'),wort=w.contents[-1])) 
+0

謝謝,你知道嗎,我怎麼能得到標點符號,那是w之外的? Закр'ой - закр'oй