2012-07-29 52 views
1

這不是我第一次在XML庫中使用htmlParse時遇到問題,但過去我剛剛放棄並使用來解析我需要的東西。我寧願通過解析XML/XHTML來完成,因爲我們都知道正則表達式不是解析器。在R的XML庫中調試htmlParse

這就是說,我發現從解析命令的錯誤信息是最好的沒有幫助,我不知道如何繼續。例如:

> htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50)) 
Error in htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", : 
    File 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head id="ctl00_Head1"> 
     <title></title> 
     <script language="JavaScript" type="text/javascript"> 
      var s_pageName = document.title; 
      var s_channel = "Take Care"; 
      var s_campaign = ""; 
      var s_eVar1 = "" 
      var s_eVar2 = "" 
      var s_eVar22 = "" 
      var s_eVar23 = "" 
     </script> 
     <meta name="keywords" content="take care clinic, walgreens clinic, walgreens take care clinic, take care health, urgent care clinic, walk in clinic" /> 
     <meta name="description" content="Information about simple, quality healthcare for the whole family from Take Care Clinics at select Walgreens, including Take Care Clinic hours, providers, offers, insurance and quality of care." /> 
     <link rel="shortcut icon" hre 

我很高興它看到的東西在那裏,但我在哪裏鑽取「錯誤:文件」?

請注意,據我所知,這是形式良好的XHTML。當我訪問link manually時,我可以運行xpaths並且Firebug不會抱怨。

如何從這樣的htmlParse調試錯誤?

+0

@ttmaccer有趣。畢竟這是一個畸形的代碼問題。 – 2012-07-29 23:12:00

+0

這很有道理。謝謝。 – 2012-07-29 23:25:13

回答

3

先下載再傳遞到XML包似乎工作

test<-getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50) 
htmlParse(test,asText=T) 

或直接

htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50),asText=T) 

也似乎罰款

+0

所以我想這是它的原文參數嗎? – 2012-07-29 23:13:36