我一直使用Weka的J48決策樹將RSS源中關鍵字 的頻率分類爲目標類別。我想我可能有一個問題 協調生成的決策樹與正確分類的 實例的數量和混淆矩陣。決策樹中正確/不正確分類的實例與Weka中的混淆矩陣之間的區別
例如,我.arff文件一個包含以下數據提取:
@attribute Keyword_1_nasa_Frequency numeric
@attribute Keyword_2_fish_Frequency numeric
@attribute Keyword_3_kill_Frequency numeric
@attribute Keyword_4_show_Frequency numeric
...
@attribute Keyword_64_fear_Frequency numeric
@attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
@data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S
等等:共有64個關鍵字(列)和570行,其中每一個包含的頻率是關鍵字在飼料中一天。在這種情況下,共有57個記錄供 10天共計570個記錄進行分類。每個關鍵字的前綴爲 ,並帶有替代號碼,後綴爲「頻率」。
我對決策樹的使用是使用10x驗證的默認參數。
Weka的報告如下:
Correctly Classified Instances 210 36.8421 %
Incorrectly Classified Instances 360 63.1579 %
用下面的混淆矩陣:
=== Confusion Matrix ===
a b c d e f g <-- classified as
11 0 0 0 39 0 0 | a = BFE
0 0 0 0 60 0 0 | b = FCL
1 0 5 0 72 0 2 | c = F
0 0 1 0 69 0 0 | d = M
3 0 0 0 153 0 4 | e = NCA
0 0 0 0 90 10 0 | f = SNT
0 0 0 0 19 0 31 | g = S
樹如下:
Keyword_22_health_Frequency <= 0
| Keyword_7_open_Frequency <= 0
| | Keyword_52_libya_Frequency <= 0
| | | Keyword_21_job_Frequency <= 0
| | | | Keyword_48_pic_Frequency <= 0
| | | | | Keyword_63_world_Frequency <= 0
| | | | | | Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
| | | | | | Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
| | | | | Keyword_63_world_Frequency > 0
| | | | | | Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
| | | | | | Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
| | | | Keyword_48_pic_Frequency > 0: F (7.0)
| | | Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
| | Keyword_52_libya_Frequency > 0: NCA (31.0)
| Keyword_7_open_Frequency > 0
| | Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
| | Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)
我的問題的擔憂矩陣調和的樹或反之亦然。至於 我瞭解結果,像(461.0/343.0)這樣的評分表明461個實例已被歸類爲NCA。但是,當矩陣只顯示153時,怎麼可能呢?我是 不知道如何解釋這個,所以任何幫助是值得歡迎的。
在此先感謝。
非常感謝您的明確解釋。我擔心我的數據存在潛在的問題,我有72個.arff文件需要分類。我知道樹和矩陣之間的細微差別是不正確的分類實例,我想總會有一些這樣的。 但是你的最後兩個句子讓我困惑:我使用10x驗證,因此每個分類只能有一棵樹和一個矩陣,不是嗎? – 2012-08-08 19:23:43
是的,如果您使用GUI,那麼輸出將是一棵樹和一個矩陣,兩者都用於交叉驗證測試。我幾乎總是從命令行使用Weka,並在該模式下輸出一棵樹和一個矩陣,用於測試訓練數據以及交叉驗證測試。我會改變我的答案以反映這一點。 – stackoverflowuser2010 2012-08-08 19:31:48
再次感謝:就我而言,我直接從Java生成樹,並通過GraphViz重新格式化它們以改善它們的外觀。這種方法的唯一問題是,我不知道如何輸出混淆矩陣和其他細節。所以這一點現在是通過GUI手動完成的。 – 2012-08-08 19:34:27