2016-08-12 77 views
1

我再次遇到一個非常特殊的問題。我對VBA非常陌生,尤其是HTML,所以請耐心等待。我已經在VBA中構建了一個功能正常的網頁抓取工具,但是我想完成一些我無法弄清楚的具體任務。如何從VBA中的特定父元素HTML元素的子元素中獲取數據

Here is the HTML sample that my question refers to.

我已經更換了所有不與橢圓重要的東西。我想刮掉的重要部分是「a」標籤,「數據簡寫」(或只是innerText)。這是一個網站名稱。有多達五個,但並不總是五個。這也只是列出最多五個網站的兩個部分之一。發佈的部分有<div class="referralsSites referring">的子女,另一個有子女<div class="referralsSites destination">

我想分配給「Up」變量的第一個網站分配給「Up1」,第二個分配給「Up2」等,但僅取決於有多少個網站在「轉介」部分。我想在「目標」部分執行相同的操作,但是根據目標站點的數量,分配給「下變量」(Down1,Down2等)。

例如,如果我只是使用getElementsByClassName("websitePage-listItemLink js-tooltipTarget"),我將無法區分推介網站和目標網站。

這是到目前爲止我的代碼:

Sub GetSimilarWebData() 
    Dim appIE As InternetExplorer 
    Dim HTML As HTMLDocument 
    Dim ieWindow As SHDocVw.InternetExplorer 
    Dim URL As String 
    Dim Rankings As IHTMLElementCollection, Traffic As IHTMLElementCollection, ReferSites As IHTMLElementCollection, DestSites As IHTMLElementCollection, _ 
     rSite As IHTMLElement, rSiteNo As Long, dSite As IHTMLElement, dSiteNo As Long, GlobalRank As String, CountryName As String, CountryRank As String, _ 
     Visits As String, Direct As String, Refer As String, Search As String, Social As String, Display As String, _ 
     Up1 As String, Up2 As String, Up3 As String, Up4 As String, Up5 As String, _ 
     D1 As String, D2 As String, D3 As String, D4 As String, D5 As String 
    Dim FraudLast As Long 
    CheckLast = Worksheets("Sheet1").Range("I1").End(xlDown).Offset(1).Row 

    webStr = Worksheets("Sheet1").Range("A" & CheckLast).Value 

    Set appIE = New InternetExplorer 
     appIE.Visible = False 
     appIE.navigate "https://www.similarweb.com/website/" & webStr 

    Do While appIE.readyState <> READYSTATE_COMPLETE 
     Application.StatusBar = "Connecting to SimilarWeb..." 
     DoEvents 
    Loop 

    Set HTML = appIE.document 

    Set appIE = Nothing 
     Application.StatusBar = "" 

    Set Rankings = HTML.getElementsByClassName("rankingItem-value") 
     GlobalRank = Rankings(0).innerText 
      If GlobalRank = "N/A" Then 
       GlobalRank = "null" 
       CountryName = "null" 
       CountryRank = "null" 
      Else 
       CountryName = HTML.getElementsByClassName("rankingItem-subTitle")(1).innerText 
       CountryRank = Rankings(1).innerText 
      End If 

    Visits = HTML.getElementsByClassName("engagementInfo-value engagementInfo-value--large u-text-ellipsis")(0).innerText 
     If InStr(Visits, "M") <> 0 Then 
      Visits = Replace(Visits, ".", "") 
      Visits = Replace(Visits, "M", "00000") 
     ElseIf InStr(Visits, "K") <> 0 Then 
      Visits = Replace(Visits, ".", "") 
      Visits = Replace(Visits, "K", "00") 
     ElseIf InStr(Visits, "B") <> 0 Then 
      Visits = Replace(Visits, ".", "") 
      Visits = Replace(Visits, "B", "00000000") 
     End If 

    Set Traffic = HTML.getElementsByClassName("trafficSourcesChart-value") 
     Direct = Traffic(0).innerText 
     Refer = Traffic(1).innerText 
     Search = Traffic(2).innerText 
     Social = Traffic(3).innerText 
     Display = Traffic(4).innerText 

'Here's what I've started off with:  
    Set ReferSite = HTML.getElementsByClassName("referralsSites referring") 
     rSiteNo = ReferSite.Length 
    Set DestSite = HTML.getElementsByClassName("referralsSites destination") 
     dSiteNo = DestSite.Length 
     'For Each rSite In ReferSite 
End Sub 

我真的不知道如何來解決這個問題。我的代碼中的其他所有代碼都可以正常工作,但是當然如果有什麼我可以做的來提高速度也是值得歡迎的。

所有這些都是指similarweb.com上的數據。

回答

1

getElementsByClassName方法可用於IHTMLElement對象以及HTMLDocument對象。這意味着您可以通過兩個「跳躍」獲得單獨的推介鏈接和目標網站列表。

首先得到<div> s,其類別名稱爲referralsSites referringreferralsSites destinationgetElementsByClassName方法返回IHTMLElementCollection,它是IHTMLElement的集合。所以你得到集合中的第0個元素(假設只有一個<div>),然後通過的<div>再次調用getElementsByClassName方法得到<div>中的那個<div>websitePage-listItemLink的類別中的<a>

這裏是一個stackoverflow.com的例子 - 我只是在做Debug.Print的輸出,但你可能想分配一個數組,或Collection什麼的網站名稱。

Option Explicit 

Sub Test() 

    'references required: 
    'Microsoft HTML Object Library 
    'Microsoft Internet Controls 

    Dim strUrl As String 
    Dim objIe As InternetExplorer 
    Dim objHtml As HTMLDocument 
    Dim strHtml As String 
    Dim objDivs As IHTMLElementCollection 
    Dim objAnchors As IHTMLElementCollection 
    Dim intCounter As Integer 

    'set target to scrape 
    strUrl = "https://www.similarweb.com/website/stackoverflow.com" 

    'get html from page 
    Set objIe = New InternetExplorer 
    objIe.Visible = False 
    objIe.navigate strUrl 
    While objIe.readyState <> READYSTATE_COMPLETE 
     DoEvents 
    Wend 

    'assign html to DOM document 
    Set objHtml = New HTMLDocument 
    Set objHtml = objIe.document 

    'get referrals 
    Set objDivs = objHtml.getElementsByClassName("referralsSites referring") 
    If objDivs.Length > 0 Then 
     Set objAnchors = objDivs(0).getElementsByClassName("websitePage-listItemLink") 
     Debug.Print "Referrers:" 
     If objAnchors.Length > 0 Then 
      For intCounter = 0 To objAnchors.Length - 1 
       Debug.Print objAnchors(intCounter).innerText 
      Next intCounter 
     End If 
    End If 

    'get destinations 
    Set objDivs = objHtml.getElementsByClassName("referralsSites destination") 
    If objDivs.Length > 0 Then 
     Set objAnchors = objDivs(0).getElementsByClassName("websitePage-listItemLink") 
     Debug.Print "Destinations:" 
     If objAnchors.Length > 0 Then 
      For intCounter = 0 To objAnchors.Length - 1 
       Debug.Print objAnchors(intCounter).innerText 
      Next intCounter 
     End If 
    End If 

    'clean up 
    Set objHtml = Nothing 
    objIe.Quit 
    Set objIe = Nothing 

End Sub 

這給出了一個輸出:

Referrers: 
news.ycombinator.com 
qwant.com 
github.com 
remoteok.io 
serverfault.com 
Destinations: 
jsfiddle.net 
youtube.com 
github.com 
i.stack.imgur.com 
w3schools.com 
+0

謝謝!!這正是我所期待的。我沒有意識到'getElementsByClassName()'也可以訪問子元素。 – TradorDave

相關問題