2017-04-13 103 views
1

我有這個奇怪的XML我試圖解析,並在閱讀此後,我仍然有問題。Python解析奇怪的XML?

我想解析NIST CVE數據庫,它只能用XML。這是它的一個例子。

<?xml version='1.0' encoding='UTF-8'?> 
<nvd xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.1" xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2" xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:patch="http://scap.nist.gov/schema/patch/0.1" xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0" xmlns:cpe-lang="http://cpe.mitre.org/language/2.0" nvd_xml_version="2.0" pub_date="2017-04-12T18:00:08" xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1 https://scap.nist.gov/schema/nvd/patch_0.1.xsd http://scap.nist.gov/schema/feed/vulnerability/2.0 https://scap.nist.gov/schema/nvd/nvd-cve-feed_2.0.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd"> 
    <entry id="CVE-2013-7450"> 
    <vuln:vulnerable-configuration id="http://nvd.nist.gov/"> 
     <cpe-lang:logical-test operator="OR" negate="false"> 
     <cpe-lang:fact-ref name="cpe:/a:pulp_project:pulp:2.2.1-1"/> 
     </cpe-lang:logical-test> 
    </vuln:vulnerable-configuration> 
    <vuln:vulnerable-software-list> 
     <vuln:product>cpe:/a:pulp_project:pulp:2.2.1-1</vuln:product> 
    </vuln:vulnerable-software-list> 
    <vuln:cve-id>CVE-2013-7450</vuln:cve-id> 
    <vuln:published-datetime>2017-04-03T11:59:00.143-04:00</vuln:published-datetime> 
    <vuln:last-modified-datetime>2017-04-11T10:01:04.323-04:00</vuln:last-modified-datetime> 
    <vuln:cvss> 
     <cvss:base_metrics> 
     <cvss:score>5.0</cvss:score> 
     <cvss:access-vector>NETWORK</cvss:access-vector> 
     <cvss:access-complexity>LOW</cvss:access-complexity> 
     <cvss:authentication>NONE</cvss:authentication> 
     <cvss:confidentiality-impact>NONE</cvss:confidentiality-impact> 
     <cvss:integrity-impact>PARTIAL</cvss:integrity-impact> 
     <cvss:availability-impact>NONE</cvss:availability-impact> 
     <cvss:source>http://nvd.nist.gov</cvss:source> 
     <cvss:generated-on-datetime>2017-04-11T09:43:13.623-04:00</cvss:generated-on-datetime> 
     </cvss:base_metrics> 
    </vuln:cvss> 
    <vuln:cwe id="CWE-295"/> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>MLIST</vuln:source> 
     <vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/11" xml:lang="en">[oss-security] 20160418 CVE-2013-7450: Pulp &lt; 2.3.0 distributed the same CA key to all users</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>MLIST</vuln:source> 
     <vuln:reference href="http://www.openwall.com/lists/oss-security/2016/04/18/5" xml:lang="en">[oss-security] 20160418 Re: CVE request - Pulp &lt; 2.3.0 shipped the same authentication CA key/cert to all users</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>MLIST</vuln:source> 
     <vuln:reference href="http://www.openwall.com/lists/oss-security/2016/05/20/1" xml:lang="en">[oss-security] 20160519 Pulp 2.8.3 Released to address multiple CVEs</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="PATCH"> 
     <vuln:source>CONFIRM</vuln:source> 
     <vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1003326" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1003326</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="PATCH"> 
     <vuln:source>CONFIRM</vuln:source> 
     <vuln:reference href="https://bugzilla.redhat.com/show_bug.cgi?id=1328345" xml:lang="en">https://bugzilla.redhat.com/show_bug.cgi?id=1328345</vuln:reference> 
    </vuln:references> 
    <vuln:references xml:lang="en" reference_type="VENDOR_ADVISORY"> 
     <vuln:source>CONFIRM</vuln:source> 
     <vuln:reference href="https://github.com/pulp/pulp/pull/627" xml:lang="en">https://github.com/pulp/pulp/pull/627</vuln:reference> 
    </vuln:references> 
    <vuln:summary>Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations.</vuln:summary> 
    </entry> 
<nvd> 

我試圖與ET解析它,但我得到一些奇怪的輸出...

例如,當我用這個,

with open('/tmp/nvdcve-2.0-modified 2.xml', 'rt') as f: 
    tree = ElementTree.parse(f) 
for child in root: 
    print child.tag, child.attrib 

我的輸出看起來是這樣的.. 。

{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry {'id': 'CVE-2007-6759'} 

是什麼使得它混亂,是如果我想遍歷它,我似乎需要做..

for child in root.iter('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'): 

如果我這樣做,但我不知道孩子的孩子是什麼,或者什麼都不知道。

例如,我試圖拔出vuln:cve-id,並且每個個體cvss:base_metrics(評分訪問向量),vuln:summaryvuln:product

基本上,我試圖從NIST網站每隔一小時下載一次「xml流」並將其更新到本地mysql數據庫中,這樣我在我的環境中執行漏洞評估時也可以查詢本地。搞清楚如何迭代這個XML的東西是混亂的地獄。我想嘗試將它轉換爲JSON,但由於沒有1:1的XML/JSON轉換,這似乎是一個不必要的額外步驟,可能存在問題。

回答

1

是的,帶名稱空間的XML必須被處理a little differently。這是繼續使用ElementTree API的另一個解決方案。

在這個庫的命名空間,在那裏你看到vuln:summary你需要查找的根元素的​​屬性vuln命名空間,然後把它稱爲{http://scap.nist.gov/schema/vulnerability/0.4}summary工作。

import xml.etree.ElementTree as ET 
tree = ET.parse('nvdcve-2.0-Modified.xml') 
root = tree.getroot() 
# default namespace is given by xmlns attribute of root element, still must be provided 
for entry in root.findall('{http://scap.nist.gov/schema/feed/vulnerability/2.0}entry'): 
    product_list = [] 
    metric_list = [] 
    # just use the element's id attribute 
    id = entry.get('id') 

    summary = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}summary').text 

    software = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}vulnerable-software-list') 
    if software is not None: 
     for sw in software.findall('{http://scap.nist.gov/schema/vulnerability/0.4}product'): 
      product_list.append(sw.text) 

    metrics = entry.find('{http://scap.nist.gov/schema/vulnerability/0.4}cvss') 
    if metrics is not None: 
     for metric in metrics.find('{http://scap.nist.gov/schema/cvss-v2/0.2}base_metrics').findall('*'): 
      # we don't know the element name, but can get it with the tag property 
      metric_list.append(metric.tag.replace('{http://scap.nist.gov/schema/cvss-v2/0.2}', '') + ': ' + metric.text) 

    print(id, summary, product_list, metric_list) 
    #save to database! 
+0

很好,謝謝。我不熟悉命名空間,第一次使用XML,這是超級混亂。通常我只使用JSON。 – Mallachar

+0

最後一個問題,如果我可能, 我該怎麼去獲得,具體來說,cvss:得分?我知道我可以做metric_list [0],但是如果不是拉動所有的基本度量標準,我想拉那個呢?我會做另一個嵌套for循環? – Mallachar

+0

只要看看現有的代碼。但用你正在尋找的特定元素替換'findall('*')'。 – miken32

2

這是一個命名空間 XML文檔。因此,您需要使用各自的名稱空間來尋址節點。

在文檔中所使用的命名空間在文檔的頂部定義,並且被映射到所謂的命名空間前綴

xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0" 
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2" 
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4" 
... 

所以前綴vuln被映射到"http://scap.nist.gov/schema/vulnerability/0.4"例如。

沒有前綴的一個被稱爲默認命名空間 - 它適用於不使用顯式的命名空間前綴(如根節點nvdentry節點)的所有節點。


所以,你要麼需要使用完全合格的命名空間,或適當的名稱空間前綴(在你的代碼,你可以映射不同比他們已經解析文檔中被映射)來解決這些要素。

下面是做的一個例子,使用lxml(和XPath表達式):

from lxml import etree 

NSMAP = { 
    'n': 'http://scap.nist.gov/schema/feed/vulnerability/2.0', 
    'cpe-lang': 'http://cpe.mitre.org/language/2.0', 
    'cvss': 'http://scap.nist.gov/schema/cvss-v2/0.2', 
    'patch': 'http://scap.nist.gov/schema/patch/0.1', 
    'scap-core': 'http://scap.nist.gov/schema/scap-core/0.1', 
    'vuln': 'http://scap.nist.gov/schema/vulnerability/0.4', 
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance', 
} 


def normalized_tag(node): 
    return node.tag.replace('{%s}' % node.nsmap[node.prefix], '') 


root = etree.parse(open('nvdcve.xml')).getroot() 


entries = root.xpath('//n:nvd/n:entry', namespaces=NSMAP) 
for entry in entries: 
    print "Entry: %r" % entry.attrib['id'] 

    # CVE ID 
    cve_id = entry.xpath('./vuln:cve-id/text()', namespaces=NSMAP)[0] 
    print " CVE ID: %r" % cve_id 

    # Base Metrics 
    metrics = entry.xpath('./vuln:cvss/cvss:base_metrics/*', namespaces=NSMAP) 
    print " Base Metrics:" 
    for metric in metrics: 
     metric_name = normalized_tag(metric) 
     metric_value = metric.text 
     print " %s: %s" % (metric_name, metric_value) 

    # Summary 
    summary = entry.xpath('./vuln:summary/text()', namespaces=NSMAP)[0] 
    print " Summary: %s" % summary 

    # Products 
    products = entry.xpath('./vuln:vulnerable-software-list/vuln:product', 
          namespaces=NSMAP) 
    for product in products: 
     print " Product: %s" % product.text 

輸出:

Entry: 'CVE-2013-7450' 
    CVE ID: 'CVE-2013-7450' 
    Base Metrics: 
    score: 5.0 
    access-vector: NETWORK 
    access-complexity: LOW 
    authentication: NONE 
    confidentiality-impact: NONE 
    integrity-impact: PARTIAL 
    availability-impact: NONE 
    source: http://nvd.nist.gov 
    generated-on-datetime: 2017-04-11T09:43:13.623-04:00 
    Summary: Pulp before 2.3.0 uses the same the same certificate authority key and certificate for all installations. 
    Product: cpe:/a:pulp_project:pulp:2.2.1-1 

有關XML命名空間的更多信息,請參閱Namespaces section in the lxml tutorialWikipedia article on XML Namespaces


有關XPath語法的更多信息,請參見例如XPath Syntax頁面中W3Schools Xpath Tutorial

要開始使用XPath,在許多XPath testers之一中擺弄文檔也會非常有幫助。此外,Firefox的Firebug插件或Google Chrome檢查器允許您顯示所選元素的XPath(或者更多)XPath。

+0

啊很高興知道,thakn你。沒有意識到命名空間或以這種方式工作過的東西。嘗試使用ET教程與此相比令人困惑。謝謝! – Mallachar