2011-09-20 19 views
3

我想讀和提取的XLS文件是真的單個文件網頁上看到下面是否有一個主標籤列表中的標籤及其含義爲mhtml文件?

This document is a Single File Web Page, also known as a Web Archive file. 

我試圖找出所有標記的含義,所以我可以保證我分析他們的數據正確使用lxml。

例如這裏是一個標籤的例子:

<th class=3Dtl colspan=3D1 rowspan=3D2 

雖然我有成功,有我在玩弄我想嘗試弄清楚,如果我提出的假設將在幾個文件的工作後回來困擾我。因此,這些標籤及其含義的列表會很好。

回答

0

如果MHTML是從Microsoft Word生成的,則可能是WordprocessingMLHTML4標記的組合。

一個WordprocessingML文檔中的頂級元素是:

SmartTagType element describes a Smart Tag type used in the document. 
DocumentProperties element contains Office Document Properties. 
CustomDocumentProperties element contains Custom Office Document Properties. 
schemaLibrary element defines a collection of schemas that comprise a document's schema library. 
fonts element (wordDocumentElt complexType) contains font information 
frameset element (wordDocumentElt complexType) contains HTML Frameset definitions. 
styles element (wordDocumentElt complexType) contains style definitions. 
divs element contains HTML DIV information. 
shapeDefaults element contains drawing defaults. 
docOleData element contains supplemental data containing storages for OLE objects. 
docSuppData element contains supplemental data containing toolbar customizations, envelope data, and the Microsoft Visual Basic project. 
docPr element contains document options. 
shapeDefaults element contains the wrapper representing the shape defaults. 
bgPict element contains background picture information. 
body element contains the document body. 

然而,最簡單WordprocessingML文檔只包含五個元素(和單個名稱空間)的。這五個要素是:

wordDocument element: The root element for a WordprocessingML document. 
body element: The container for the displayable text. 
p element: A paragraph. 
r element: A contiguous set of WordprocessingML components with a consistent set of properties. 
t element: A piece of text. 
相關問題