2017-11-25 241 views
3

我有一個問題,似乎之前已經問過,但有點不同。我試圖抓取this website的數據,但問題是,它似乎像加載了AJAX。因爲我的應用程序無法找到我正在尋找的HTML中的id和類。在Xamarin中等待AJAX​​與HtmlAgilityPack

您可以通過檢查元素或查看源來重現此操作。在查看源代碼時,我看到的是比檢查元素時少很多。

我以爲我可以追查包含AJAX按F12,將網絡標籤,然後選擇XHR加載此HTML文件,但我無法找到它。

我的問題是:我如何檢索這些數據或找出用於收集數據的文件是 ?

我的代碼示例(我找不到Timetable_toolbar_elementSelect_popup0):

private async Task GetHtmlDocument(string url) 
     { 
      HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); 
      //request.Credentials = new LoginCredentials().Credentials; 

      try 
      { 
       WebResponse myResponse = await request.GetResponseAsync(); 
       HtmlDocument htmlDoc = new HtmlDocument(); 
       htmlDoc.OptionFixNestedTags = true; 
       htmlDoc.Load(myResponse.GetResponseStream()); 
       var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0"); 
      } 
      catch (Exception e) 
      { 
      } 
     } 
+0

你究竟想要刮到什麼?我訪問過這個網站並沒有看到任何Timetable_toolbar_elementSelect_popup0。 – derloopkat

+0

@derloopkat對不起,如果您在菜單中的「Lesrooster」和「Klassen」上的klik,您將在右頁。然而,顯然你還需要先點擊「Klas」下的下拉菜單,才能看到帶有id的容器。 – user3478148

+0

我還沒有機會檢查評論,Kent ...我會這麼做當我繼續我的項目。 – user3478148

回答

0

使用webrequest調用ajax方法的解決方案。

所以我覺得無聊,想通了。下面缺少的是如何通過id來識別Klase。下面的例子將獲取klase'1GLD'。我們需要cookies的原因是爲了讓請求知道我們從哪個學校取得Klase。此外,下面的代碼只返回JSON - 而不是HTML,因爲它是我們所稱的ajax方法。

CookieContainer cookies = new CookieContainer(); 
try 
{ 
    string webAddr = "https://roosters.windesheim.nl/"; 
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr); 
    httpWebRequest.ContentType = "application/json; charset=utf-8"; 
    httpWebRequest.Method = "POST"; 
    httpWebRequest.CookieContainer = cookies;   
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate; 
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest"); 

    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse(); 
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream())) 
    { 
     cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri)); 
    } 
} 
catch (WebException ex) 
{ 
    Console.WriteLine(ex.Message); 
} 

//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then. 
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request. 
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100; 

//we are now ready to call the ajax method and get the JSON. 
try 
{ 
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString(); 
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr); 
    httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8"; 
    httpWebRequest.Method = "POST"; 
    httpWebRequest.CookieContainer = cookies; 
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate; 
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest"); 

    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream())) 
    { 
     string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2"; 

     //The command below will return a JSON datastructure containing all the klases and their relevant ID. 
     //string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2" 


     streamWriter.Write(json); 
     streamWriter.Flush(); 
    } 


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse(); 
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream())) 
    { 
     var responseText = streamReader.ReadToEnd(); 
     //THE RESULTS GETS PRINTED HERE. 
     Console.Write(responseText); 
    } 
} 
catch (WebException ex) 
{ 
    Console.WriteLine(ex.Message); 
} 

其他解決方案與Selenium Firefox驅動程序。

這樣做比較容易。但也需要一些時間。並非所有的睡眠都是必要的。這將使HTML與isntead一起工作,就像你所要求的一樣。但我發現它在最後的foreach循環中是必需的。

public static void Main(string[] args) 
{ 
    HtmlDocument doc = new HtmlDocument(); 
    //According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then. 
    //I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request. 
    long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100; 
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString(); 
    var ffOptions = new FirefoxOptions(); 
    ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe"; 
    ffOptions.LogLevel = FirefoxDriverLogLevel.Default; 
    ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true }; 
    var service = FirefoxDriverService.CreateDefaultService(); 

    var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120)); 


    driver.Navigate().GoToUrl(webAddr); 


    driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter); 
    Thread.Sleep(2000); 
    driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click(); 

    driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click(); 
    Thread.Sleep(2000); 

    driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click(); 

    //we get all the options for Klase 
    doc.LoadHtml(driver.PageSource); 
    HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]"); 
    List<String> options = new List<String>(); 
    foreach (HtmlNode n in nodes) 
    { 
     options.Add(n.InnerText); 
    } 

    foreach(string s in options) 
    { 
     driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear(); 
     driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s); 
     Thread.Sleep(2000); 
     driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter); 
     Thread.Sleep(2000); 
     doc.LoadHtml(driver.PageSource); 
     //Console.WriteLine(driver.Url); //Now we can see the id of the current Klase 
    } 

    Console.WriteLine(doc.DocumentNode.InnerHtml); 

    Console.ReadKey(); 
} 

最後更新

使用Selenium的解決方案,我能得到的ID的所有課程。我已包含文件here,以便您可以將它與您的ajax和Web請求一起使用。

1

我要離開這個註釋。但它格式太大,格式太差。所以,我們走了。

首先,該網站使用通過ajaxcommand調用的JavaScript動態更新。

如果你可以打開一個會話,並存儲包含SESSIONID和現在的「加密」 schoolname,那麼你可以調用Ajax的命令,這樣的餅乾。

https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2 

但是,這確實需要你知道elementType是什麼和elementId是什麼。

在這種情況下,elementId在Klas等於1GLD時表示Klas。 formatID(7)在等於「Beknopt」時表示Roosterformaat。你必須弄清楚剩餘變量的作用。更重要的是,如果您成功地向服務器發出了有效的ajax命令,那麼您將不會返回HTML作爲響應,您將以JSON接收數據。

做你想做的最簡單的方法是在一個單獨的file所有類。並將其用作參考點。其他選項也一樣。

,然後使用一個無頭的瀏覽器,如phantomjs.orgSelenium。通過這種方式,您可以找到並點擊您想要抓取的課程。將HTML加載到HtmlAgilityPack.HtmlDocument中,然後執行您需要執行的操作。 Selenium/PhantomJS直到跟蹤你的cookies。 這種方法比較慢 - 但要容易得多。

編輯從webrequest存儲cookie - 簡單的方法。

我並不熱衷於這個問題。但OP問。如果有人有更好的方法,請編輯。

CookieContainer cookies = new CookieContainer(); 
try 
{ 
    string webAddr = "https://roosters.windesheim.nl/WebUntis/"; 

    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr); 
    httpWebRequest.ContentType = "application/json; charset=utf-8"; 
    httpWebRequest.Method = "POST"; 
    httpWebRequest.CookieContainer = cookies; 

    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate; 
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest"); 
    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream())) 
    { 
     string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2"; 

     streamWriter.Write(json); 
     streamWriter.Flush(); 
    } 


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse(); 
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream())) 
    { 
     cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri)); 
     //cookies.Add(httpResponse.Cookies); 
     var responseText = streamReader.ReadToEnd(); 
     doc.LoadHtml(responseText); 
     foreach(Cookie c in httpResponse.Cookies) 
     { 
      Console.WriteLine(c.ToString()); 
     } 
    } 
} 
catch (WebException ex) 
{ 
    Console.WriteLine(ex.Message); 
} 
    Console.WriteLine(doc.DocumentNode.InnerHtml); 

    Console.ReadKey(); 
+0

關於您評論的最後一段,如果您使用Selenium,使用HtmlAgilityPack加載文檔沒有意義。 Selenium支持xpath,css和id選擇器。 HtmlAgilityPack只是一個用於解析Html的庫,並且還支持xpath,但是沒有瀏覽器在後臺運行。 – derloopkat

+0

謝謝。這似乎比我希望的要複雜得多。一個問題:「如果你可以打開會話並存儲包含SESSIONID和現在」加密「學校名稱的cookie,我不知道如何做到這一點,你能指點我的方向嗎?我會研究Selenium/PhantomJS – user3478148