xml with BeautifulSoup

from bs4 import BeautifulSoup 

list = (glob.glob("/home/anastasiya/PycharmProjects/bachelor/rutexts/*.xhtml")) 
for text in list: 
print(text) 
with open(text, "r", encoding="windows-1251") as file: 
    with open("ruscorpus.txt", "a") as file2: 
     for line in file: 
      soup = BeautifulSoup(line, "lxml") 
      if soup.w is not None: 
         file2.write("{wort}\t{gr}\t{lex}\n".format(
         lex=soup.w.ana.get('lex'), 
         gr=test(soup.w.ana.get('gr')), 
         wort=soup.w.contents[-1]))

我嘗試從xml獲取一些信息。格式是這樣的。的運行程序，但如果我們有2個字1瓦特標籤，它採取的第一個與整個標籤輸出： xml with BeautifulSoup

來源

2017-05-08 Nastja Kryvoscheya

爲什麼你是聰明人讀你的'xml'數據線？ –

Check online demo

使用soup.find_all('w')它會給所有w

列表

soup.w僅給出w

來源

2017-05-08 08:12:27

1中第一次出現，你的代碼試圖讀取該文件text的線行，然後將它傳遞給BS4解析。我建議你可以直接將打開的文件引用傳遞給bs4。

2，在bs4中，您可以通過find_all找到所有特定標記，如w標記內容。

更改您這樣的代碼：

with open(text, "r", encoding="windows-1251") as file1, open("ruscorpus.txt", "a") as file2: 
    xml_soup = BeautifulSoup(file1,'lxml') 
    for w in xml_soup.find_all('w'): # get all w tag and parse them 
     file2.write("{wort}\t{gr}\t{lex}\n".format(lex=w.ana.get('lex'),gr=w.ana.get('gr'),wort=w.contents[-1]))

來源

2017-05-08 09:24:40

謝謝，你知道嗎，我怎麼能得到標點符號，那是w之外的？ Закр'ой - закр'oй 。 –

xml with BeautifulSoup

回答

相關問題