我遇到了一個奇怪的子集問題。問題是我可以對一列進行子集劃分,但是我無法對另一列進行子集劃分。這兩列似乎都以同樣的方式被readHTMLTable解析。在由readHTMLTable解析的因子列上進行子集化返回無結果
的代碼複製
require(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_stock_exchanges"
html <- htmlParse(theurl)
seData <- readHTMLTable(html)[[2]]
names(seData) = c("Rank","EX","Economy","HQ","MarketCap","TradeValue")
seData = transform(seData,MarketCap = as.numeric(gsub(",","",MarketCap)))
seData = transform(seData,TradeValue = as.numeric(gsub(",","",TradeValue)))
我想子集印度證券交易所,所以我用:
> subset(seData,seData$Economy == "India")
[1] Rank EX Economy HQ MarketCap TradeValue
<0 rows> (or 0-length row.names)
> subset(seData,seData$Economy == " India")
[1] Rank EX Economy HQ MarketCap TradeValue
<0 rows> (or 0-length row.names)
我沒有得到任何行回來,儘管已經證實,有兩行應該滿足條件,但我可以輕鬆地對另一列「EX」做同樣的事情:
> subset(seData,seData$EX == "JSE Limited")
Rank EX Economy HQ MarketCap TradeValue
17 17 JSE Limited SouthAfrica Johannesburg 903 287
我已經跑了其他功能和兩列長得一模一樣..
> sapply(seData,class)
Rank EX Economy HQ MarketCap TradeValue
"factor" "factor" "factor" "factor" "numeric" "numeric"
> levels(seData$Economy)
[1] " Australia" " Brazil" " Canada"
[4] " China" " Germany" " Hong Kong"
[7] " India" " Japan" " Russia"
...
> levels(seData$EX)
[1] "Australian Securities Exchange" "BME Spanish Exchanges"
[3] "BM&F Bovespa" "Bombay Stock Exchange"
[5] "Deutsche Börse" "Hong Kong Stock Exchange"
[7] "JSE Limited" "Korea Exchange"
...
我錯過了什麼?我使用的子集命令有什麼問題? :(
subset(seData,seData$Economy == " India")
最有可能你的數據有一些奇怪的字符,從網絡的數據進行解析時,這是典型的。例如如果我運行你的代碼,我可以看到'印度'而不是'印度' – 2013-02-14 06:46:32
另外,在一個「subset」調用中,你不需要重命名數據集,也就是說,一旦你的字符集合適,'subset(seData ,經濟==「印度」)應該有效。 – Gregor 2013-02-14 06:52:18
感謝@shujaa,從現在開始將節省一些打字力量 – marty 2013-02-14 07:06:03