從R的網頁中順序檢索數據

我已經在網絡中進行了高級搜索並獲得了一些結果。對於每個結果我都有興趣提取2個字段，「Referencia：」和「CIF」。從R的網頁中順序檢索數據

#This is the url with the results of the search 
url="http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF 
&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013 
&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar" 

#This is the url of one of the results. 
example=http://www.boe.es/buscar/doc.php?id=BOE-B-2013-15895

的CIF字段通常形式X00000000或X00000000與X=c("A","B")和0=0:9 和該Referencia場的是BOE-B-2013-15895中實施例和CIF B-32210196

莫非你幫我從R做起？

來源

2013-04-25 nopeva

退房R中 – 2013-04-25 20:22:59

的XML庫也可以添加更多的信息，像一個示例表要到出現在R？ – 2013-04-25 20:24:05

@綠色惡魔感謝你的包裝。示例表格粘貼在上面和上面的鏈接上。它只是標有「Datos generales del concurso」的框。 – nopeva 2013-04-26 06:11:21

要獲取內容，請查看httr包。您可以使用類似

content (GET (url))

來源

2013-04-25 16:51:01

@感謝傑夫艾倫我用這個命令得到了很多代碼。也許你可以提供一個簡單的例子來檢索一段數據。 – nopeva 2013-04-26 07:36:44

1）這是一塊蛋糕拿到Referencia

substrRight <- function(x, n){ 
    sapply(x, function(xx) 
    substr(xx, (nchar(xx)-n+1), nchar(xx))) 
} 

library(XML) 
u<-"http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF%20&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013%20&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar" #link 
doc1<-htmlParse(u) 'get html' 
kbbRoot <- xmlRoot(doc1) #parse it into xml 
els<-getNodeSet(kbbRoot,"//*[contains(concat(' ', @class, ' '), concat(' ', 'resultado-busqueda-link-defecto', ' '))]") #get all links by xpath 
links<-sapply(els, function(el) xmlGetAttr(el, "href")) #get inner (start with .../) 
links<-sapply(links, function(x) substr(x,start=3,stop=nchar(x))) #delete ../ 
links<-sapply(links, function(x) paste("http://www.boe.es", x,sep=""))#generate correct link 
Referencia<-sapply(links, function(x) substrRight(x,16)) # get referencia from links

2）CIF複雜得多。你必須使用正則表達式。不幸的是，我並不擅長。所以請問論壇上的其他人：「應該使用正則表達式來從字符串中獲得CIF值？」

CIFRA<-function (u){ 
    doc1<-htmlParse(u)#get html 
    kbbRoot <- xmlRoot(doc1)# parse it 
    els<-getNodeSet(kbbRoot,"//*[contains(concat('', @class,''), concat('', 'parrafo', ''))]")#select text 
    l<-sapply(els, xmlValue) #analyse each sentences 
    x<-regexpr(pattern="[A-Z][0-9]+",text=l)#Try to find CIF by using RegEXP 
    #regexp return position in string 
    ind<-which.max(x) #'number of row with CIF' 
    st<- x[ind]-3 #start position 
    en<-st+attr(x, "match.length")[ind]-1 #finish 
    res<-substring(l[ind],st,en) #select text between start and finish 
}

CIF < -sapply（鏈接功能（X）奇弗拉（X））

來源

2013-10-07 11:35:28 egonomist

從R的網頁中順序檢索數據

回答

相關問題