XPath發現的結果數量不正確

我試圖從這個例子HTML獲取數據：

<li itemprop="itemListElement"> 
    <h4> 
     <a href="/one" title="page one">one</a> 
    </h4> 
</li> 

<li itemprop="itemListElement"> 
    <h4> 
     <a href="/two" title="page two">two</a> 
    </h4> 
</li> 

<li itemprop="itemListElement"> 
    <h4> 
     <a href="/three" title="page three">three</a> 
    </h4> 
</li> 

<li itemprop="itemListElement"> 
    <h4> 
     <a href="/four" title="page four">four</a> 
    </h4> 
</li>

現在，我使用Python 3 urllib和lxml。出於某種原因，下面的代碼不能按預期工作（請閱讀評論）

scan = [] 

example_url = "path/to/html" 
page = html.fromstring(urllib.request.urlopen(example_url).read()) 

# Extracting the li elements from the html 
for item in page.xpath("//li[@itemprop='itemListElement']"): 
    scan.append(item) 

# At this point, the list 'scan' length is 4 (Nothing wrong) 

for list_item in scan: 
    # This is supposed to print '1' since there's only one match 
    # Yet, this actually prints '4' (This is wrong) 
    print(len(list_item.xpath("//h4/a")))

因此，大家可以看到，第一招是提取4個li元素，並將它們添加到列表，然後掃描每個li元素爲a元素，但問題是每個li元素中的scan實際上都是這四個元素。

...或者我想。

做一個快速的調試，我發現scan列表包含正確的四個li元素，所以我得出了一個可能的結論：上述for循環出了問題。

for list_item in scan: 
    # This is supposed to print '1' since there's only one match 
    # Yet, this actually prints '4' (This is wrong) 
    print(len(list_item.xpath("//h4/a"))) 

    # Something is wrong here...

唯一真正的問題是我無法確定錯誤。這是什麼原因？

PS：我知道，從列表中獲取a元素有一個更簡單的方法，但這僅僅是一個示例html，真正的包含更多......東西。

來源

2017-02-13 Eekan

print(len(list_item.xpath(".//h4/a")))

//意味着/descendant-or-self::node() 它開始與/，所以它會從文檔的根節點搜索。

使用.指向當前上下文節點是list_item，而不是整個文件

來源

2017-02-13 16:59:10

在你的榜樣，當XPath的與//啓動時，它會開始從文件的根目錄搜索（這就是爲什麼它是匹配全部四個錨元素）。如果你想搜索相對於li元素，那麼你會忽略開頭的斜線：

for item in page.xpath("//li[@itemprop='itemListElement']"): 
    scan.append(item) 

for list_item in scan: 
    print(len(list_item.xpath("h4/a")))

當然你也可以用.//替代//，以便搜索是相對的還有：

for item in page.xpath("//li[@itemprop='itemListElement']"): 
    scan.append(item) 

for list_item in scan: 
    print(len(list_item.xpath(".//h4/a")))

下面是從規範採取了相關報價：

2.5 Abbreviated Syntax

//是/descendant-or-self::node()/的簡稱。例如，//para是/descendant-or-self::node()/child::para的縮寫，因此將選擇文檔中的任何para元素（即使是para元素，由於文檔元素節點是根節點的子元素，所以文檔元素將被//para選中）; div//para是div/descendant-or-self::node()/child::para的簡稱，所以會選擇所有para div子女的後代。

來源

2017-02-13 17:00:15

'.//'解決了這個問題，謝謝你的回答。但爲什麼呢？首先，我們加載一個頁面並獲取它的html，然後提取'li'標籤並將每個**放入一個列表中。爲什麼使用'//'會有什麼不同？因爲在第二個'for'循環中我們遍歷每個''''標籤，所以應該有唯一的'h4'和因此'a'標籤。編輯：它可能是，即使提取'li'標籤後，我們仍然有整個HTML？這可能是真正的罪魁禍首。 – Eekan

@Eekan - 更正後，即使提取了「li」標籤，XPath查詢仍然可以訪問整個HTML。在你的例子中，'list_item'是對'li'元素的引用。我相信這樣做的原因是因爲XPath允許您遍歷樹並選擇一個父元素。這意味着'li'必須是一個引用，以便樹上的其他元素仍可用於更復雜的查詢。 –

謝謝，夥伴。我想我已經更好地理解了XPath。 – Eekan

XPath發現的結果數量不正確

回答

相關問題