Sparklyr處理分類變量

我來自R背景來了，我習慣分類變量在後端（如因子）正在處理。使用Sparklyr時，使用string_indexer或onehotencoder會非常困惑。Sparklyr移交分類變量

例如，我有一些變量已經被編碼爲原始數據集中的數值變量，但它們實際上是分類的。我想用它們作爲分類變量，但不知道我是否正確地做了。

library(sparklyr) 
library(dplyr) 
sessionInfo() 
sc <- spark_connect(master = "local", version = spark_version) 
spark_version(sc) 
set.seed(1)  
exampleDF <- data.frame (ID = 1:10, Resp = sample(c(100:205), 10, replace = TRUE), 
        Numb = sample(1:10, 10)) 

example <- copy_to(sc, exampleDF) 
pred <- example %>% mutate(Resp = as.character(Resp)) %>% 
       sdf_mutate(Resp_cat = ft_string_indexer(Resp)) %>% 
       ml_decision_tree(response = "Resp_cat", features = "Numb") %>% 
       sdf_predict() 
pred

該模型的預測不是絕對的。見下文。這是否意味着我還必須從預測轉換回Resp_cat，然後轉換爲Resp？

R version 3.4.0 (2017-04-21) 
Platform: x86_64-redhat-linux-gnu (64-bit) 
Running under: CentOS Linux 7 (Core) 

spark_version(sc) 
[1] ‘2.1.1.2.6.1.0’ 

Source: table<sparklyr_tmp_74e340c5607c> [?? x 6] 
Database: spark_connection 
     ID Numb Resp Resp_cat id74e35c6b2dbb prediction 
    <int> <int> <chr> <dbl>   <dbl>  <dbl> 
1  1 10 150  8    0 8.000000 
2  2  3 191  4    1 4.000000 
3  3  4 146  9    2 9.000000 
4  4  9 125  5    3 5.000000 
5  5  8 107  2    4 2.000000 
6  6  2 110  1    5 1.000000 
7  7  5 133  3    6 5.333333 
8  8  7 154  6    7 5.333333 
9  9  1 170  0    8 0.000000 
10 10  6 143  7    9 5.333333

來源

2017-08-14 Kevin Zheng

一般來說，Spark在處理分類數據時依賴於列元數據。在你的管道中，這是由StringIndexer（ft_string_indexer）處理。 ML總是預測標籤，而不是原始字符串。通常情況下，您可以使用ft_index_to_string提供的IndexToString變壓器。

在Spark IndexToString中可以使用a provided list of labels或Column元數據。不幸的是sparklyr實現限制在兩個方面：

It can use only metadata，這是不是在預測列設置。
ft_string_indexer丟棄訓練好的模型，所以它不能用來提取lables。

有可能我錯過了什麼，但它看起來像你必須通過joining與轉換後的數據手動映射的預測，例如：

pred %>% 
    select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
    distinct() %>% 
    right_join(pred)

Joining, by = "prediction" 
# Source: lazy query [?? x 9] 
# Database: spark_connection 
    prediction Resp_prediction ID Numb Resp Resp_cat id777a79821e1e 
     <dbl>   <chr> <int> <int> <chr> <dbl>   <dbl> 
1   7    171  1  3 171  7    0 
2   0    153  2 10 153  0    1 
3   3    132  3  8 132  3    2 
4   5    122  4  7 122  5    3 
5   6    198  5  4 198  6    4 
6   2    164  6  9 164  2    5 
7   4    137  7  6 137  4    6 
8   1    184  8  5 184  1    7 
9   0    153  9  1 153  0    8 
10   1    184 10  2 184  1    9 
# ... with more rows, and 2 more variables: rawPrediction <list>, 
# probability <list>

說明：

pred %>% 
    select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
    distinct()

創建從預測（編碼標籤）到原始標籤。我們將Resp_cat重命名爲prediction，以便它可以作爲連接密鑰，並且Resp至Resp_prediction可以避免與實際的Resp衝突。

最後我們採用正確的等值連接：

... %>% right_join(pred)

注意：

應指定樹的類型：

ml_decision_tree(
    response = "Resp_cat", features = "Numb",type = "classification")

來源

2017-08-14 16:30:23 user6910411

這是一個很好的解決方法。謝謝！我希望Sparklyr能夠在內部處理它，並且爲此打開了一張[ticket]（https://github.com/rstudio/sparklyr/issues/928）。 –

Sparklyr移交分​​類變量

Sparklyr處理分類變量

回答

相關問題

Sparklyr移交分類變量