Python如何從html文件中提取內容

我有一個從html格式的鼻子的測試報告文件。我想在Python中從文本中提取部分文本。我將通過郵件中的電子郵件發送該郵件。Python如何從html文件中提取內容

我有以下樣品：

<!DOCTYPE html> 
<html> 
<head> 
    <title>Unit Test Report</title> 
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 

<style> 
body { 
    font-family: Calibri, "Trebuchet MS", sans-serif; 
} 
* { 
    word-break: break-all; 
} 
table, td, th, .dataid { 
    border: 1px solid #aaa; 
    border-collapse: collapse; 
    background: #fff; 
} 
section { 
    background: rgba(0, 0, 0, 0.05); 
    margin: 2ex; 
    padding: 1ex; 
    border: 1px solid #999; 
    border-radius: 5px; 
} 
h1 { 
    font-size: 130%; 
} 
h2 { 
    font-size: 120%; 
} 
h3 { 
    font-size: 100%; 
} 
h4 { 
    font-size: 85%; 
} 
h1, h2, h3, h4, a[href] { 
    cursor: pointer; 
    color: #0074d9; 
    text-decoration: none; 
} 
h3 strong, a.failed { 
    color: #ff4136; 
} 
.failed { 
    color: #ff4136; 
} 
a.success { 
    color: #3d9970; 
} 
pre { 
    font-family: 'Consolas', 'Deja Vu Sans Mono', 
       'Bitstream Vera Sans Mono', 'Monaco', 
       'Courier New', monospace; 
} 

.test-details, 
.traceback { 
    display: none; 
} 
section:target .test-details { 
    display: block; 
} 

</style> 
</head> 
<body> 
    <h1>Overview</h1> 
    <section> 
     <table> 
      <tr> 
       <th>Class</th> 
       <th class="failed">Fail</th> 
       <th class="failed">Error</th> 
       <th>Skip</th> 
       <th>Success</th> 
       <th>Total</th> 
      </tr> 
       <tr> 
        <td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td> 
        <td class="failed">1</td> 
        <td class="failed">9</td> 
        <td>0</td> 
        <td>219</td> 
        <td>229</td> 
       </tr> 
      <tr> 
       <td><strong>Total</strong></td> 
       <td class="failed">1</td> 
       <td class="failed">9</td> 
       <td>0</td> 
       <td>219</td> 
       <td>229</td> 
      </tr> 
     </table> 
    </section> 
    <h1>Failure details</h1> 
      <section> 
       <h2>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2 (1 failures, 9 errors)</h2> 
       <div> 
         <section id="Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2:test_00010_import_user_invalid_credentials"> 
          <h3>test_00010_import_user_invalid_credentials: <strong>selenium.common.exceptions.NoSuchElementException</strong></h3> 
          <div class="test-details"> 
           <h4>Traceback</h4> 
           <pre class="traceback">Traceback (most recent call last): 
    File "C:\Python27\lib\unittest\case.py", line 329, in run 
    testMethod() 
    File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Regression_TestCase\RegressionProject_TestCase2.py", line 221, in test_00010_import_user_invalid_credentials 
    Globals.login_password_invalid) 
    File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Pages\security.py", line 51, in enter_invalid_userid_and_password 
    self.enter_user_id(userid) 
    File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Pages\security.py", line 32, in enter_user_id 
    user_id_element = self.get_element(*MainPageLocators.security_user_id_textfield_xpath) 
    File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Pages\base.py", line 40, in get_element 
    element = self.driver.find_element(by=how, value=what) 
    File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 712, in find_element 
    {'using': by, 'value': value})['value'] 
    File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 201, in execute 
    self.error_handler.check_response(response) 
    File "C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response 
    raise exception_class(message, screen, stacktrace) 
NoSuchElementException: Message: Message: Unable to find element with xpath == //span[@class="gwt-InlineLabel marginbelow myinlineblock" and contains(text(), "User ID (including domain)")]/following-sibling::input 

-------------------- >> begin captured stdout << --------------------- 
*** Test import_invalid_user_credentials *** 
05_12_1616_49_42 
//span[@class="gwt-InlineLabel marginbelow myinlineblock" and contains(text(), "User ID (including domain)")]/following-sibling::input 
Element not found 
Message: Unable to find element with xpath == //span[@class="gwt-InlineLabel marginbelow myinlineblock" and contains(text(), "User ID (including domain)")]/following-sibling::input 

05_12_1616_51_54 

--------------------- >> end captured stdout << ---------------------- 
---- 
# There is more html below. I have not included everything. It will be too long otherwise.

如果我在瀏覽器中打開該文件的格式如下：這是我想從HTML文件中提取文本。

Class    Fail Error Skip Success  Total 
Regression_TestCase  1 9  0  219   229

我該怎麼做？以表格格式保存會很好。謝謝，Riaz

來源

2016-05-13 Riaz Ladhani

你有沒有嘗試過任何與XML解析庫？（如https://docs.python.org/2.7/library/xml.etree.elementtree.html#module-xml.etree.ElementTree） – zezollo

我正在查找美麗的湯http://stackoverflow.com/questions/16835449/python-beautifulsoup-extract-text-between-element –

你想輸出的格式是什麼？你是否希望它看起來像excel中的表格（例如csv），還是你想要一個具有這些行和列以及間距的文本文件？ – kaisquared

您的示例html代碼包含未封閉的標籤和結束標籤，無需打開標籤。我假設你只顯示出一個樣品，並且你將提取以及形成如下文件：

<body> 
    <h1>Overview</h1> 
    <section> 
     <table> 
      <tr> 
       <th>Class</th> 
       <th class="failed">Fail</th> 
       <th class="failed">Error</th> 
       <th>Skip</th> 
       <th>Success</th> 
       <th>Total</th> 
      </tr> 
       <tr> 
        <td>Regression_TestCase</td> 
        <td class="failed">1</td> 
        <td class="failed">9</td> 
        <td>0</td> 
        <td>219</td> 
        <td>229</td> 
       </tr> 
      <tr> 
       <td><strong>Total</strong></td> 
       <td class="failed">1</td> 
       <td class="failed">9</td> 
       <td>0</td> 
       <td>219</td> 
       <td>229</td> 
      </tr> 
     </table> 
    </section> 
</body>

可以使用Etree模塊來解析代碼爲XML。 編輯：更改了使用xpath查找表的方法，並使其不會打印「總計」列。

編輯2：我現在已經使用正則表達式來提取代碼中的所有表。小心使用它，因爲這是一個非常脆弱的解決方案。如果有一個沒有關閉表格標籤的開放表格標籤，那麼它將在開放表格標籤和崩潰後提取所有文本，因爲結果字符串將不會是格式良好的xml。

import csv 
import re 
import xml.etree.ElementTree as ET 

# Extract well formed tables 
start = re.compile(r"<table>", re.IGNORECASE) 
end = re.compile(r"</table>", re.IGNORECASE) 
html_code = "" 
table = False 
with open('sample2.xml') as xmlfile: 
    for line in xmlfile: 
     if not table: 
      table = start.search(line) 
      if table: 
       html_code += line 
     else: 
      if end.search(line): 
       html_code += line[0:end.search(line).end()] 
       table = False 
      else: 
       html_code += line 
       table = not end.search(line)    
print html_code 

# Parse html code into Etree Element object 
root = ET.fromstring(html_code) 
elements = root.findall(".//tr") 
print elements 
row = [] 
with open('output.csv', 'wb') as csvfile: 
    csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"') 
    for tablerow in elements: 
     # Only write result to file if there is text inside the first column 
     if list(tablerow)[0].text: 
      for col in list(tablerow): 
       row.append(col.text) 
      csvwriter.writerow(row) 
      print row 
      row = []

如果您使用excel打開「output.csv」，您將擁有您的表格。如果您使用這種方法，請注意文檔中的安全警告（zezollo評論中的鏈接）。

或者，您可以使用正則表達式，但我太累了，無法寫出其他解決方案。也許明天，或者其他人可能會提供一種替代解決方案。

來源

2016-05-13 14:46:21 kaisquared

當我分析我收到錯誤xml.etree.ElementTree.ParseError我的HTML文件：格式不正確（標記無效）：行125列47 –

解析HTML文件中的代碼是：樹= ET.parse（R 「E：\ SeleniumTestReport.html」） –

您的html代碼中的所有開始標記是否都有結束標記（反之亦然）？至於你的第二個評論，我的代碼假設該html文件與python腳本位於同一目錄中。如果你已經把你的HTML文件放在一個不同的目錄中，那麼當然你用來解析它的路徑需要不同。 – kaisquared

Python如何從html文件中提取內容

回答

相關問題