如果在400萬觀測數據文件的每一行中出現約2000字的一個，我正在使用R和寫腳本來計算腳本。具有觀察值（df）的數據集包含兩列，一列包含文本（df $ lead_paragraph），另一列包含日期（df $ date）。如果在400萬觀測數據集的每一行中出現一個字，則計數

使用以下內容，我可以計算列表（p）中的任何單詞是否出現在df文件的lead_paragraph列的每一行中，並將答案作爲新列輸出。

df$pcount<-((rowSums(sapply(p, grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)

但是，如果我包括一覽P太多的話，運行代碼崩潰R.

我的備用策略是簡單地碎裂成片，但我不知道是否有一個更好的，這裏使用更優雅的編碼解決方案。我的傾向是使用for循環，但是我讀的所有內容都表明這不是R的首選。我對R很新，並且不是一個很好的編碼器，所以如果不清楚，我很抱歉。

df$pcount1<-((rowSums(sapply(p[1:100], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1) 
    df$pcount2<-((rowSums(sapply(p[101:200], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1) 
    ... 
    df$pcount22<-((rowSums(sapply(p[2101:2200], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)

來源

2017-08-28 chydock

一些事情/提示，但絕對不是解決方案（還）。首先，數據越大，離開基數R越好（也許使用'data.tables'？）。其次，我會使用'any'函數，在這種情況下，您可以跳過'rowSums'部分，以及不等式和乘法。第三，你知道這些單詞是否會隨機出現，或者是否有某種模式，即在開始或結束時？如果是的話，這將大大簡化事情。最後，嘗試解析文本，擺脫不必要的內存使用。 –

目標是計算每行中存在的'p'中任何字符串的出現次數嗎？這樣： '對於數據幀x的每一行，計算P中任何字符串的N個出現次數並將其合計到一個新行中？ –

@CarlBoneri - 是的，最終，我只需要知道p中的任何字符串是否出現在給定的數據行中（二進制，真/假），但計數就足夠了。 – chydock

我沒有完成這個......但是這應該指向正確的方向。使用data.table包的速度更快，但希望這可以讓您瞭解該過程。

我使用這是從http://www.norvig.com/big.txt提取到一個名爲nrv_df

library(stringi) 

> head(nrv_df) 
                  lead_para  date 
1  The Project Gutenberg EBook of The Adventures of Sherlock Holmes 2018-11-16 
2           by Sir Arthur Conan Doyle 2019-06-05 
3       15 in our series by Sir Arthur Conan Doyle 2017-08-08 
4 Copyright laws are changing all over the world Be sure to check the 2014-12-17 
5 copyright laws for your country before downloading or redistributing 2016-09-13 
6       this or any other Project Gutenberg eBook 2015-06-15 

> dim(nrv_df) 
[1] 103598  2 

I then randomly sampled words from the entire body to get 2000 unique words 
> length(p) 
[1] 2000 
> head(p) 
[1] "The"  "Project" "Gutenberg" "EBook"  "of"   "Adventures" 
> tail(p) 
[1] "accomplice" "engaged" "guessed" "row"  "moist"  "red"

然後data.frame 隨機日期和字符串，以利用stringi包，並使用正則表達式來匹配完整情況下重新創建數據集的話，我加入每一串的矢量p，並且用|然後崩潰，所以我們正在尋找之前或之後有word-boundary 任何言語：

> p_join2 <- stri_join(sprintf("\\b%s\\b", p), collapse = "|") 
> p_join2 

[1] "\\bThe\\b|\\bProject\\b|\\bGutenberg\\b|\\bEBook\\b|\\bof\\b|\\bAdventures\\b|\\bSherlock\\b|\\bHolmes\\b|\\bby\\b|\\bSir\\b|\\bArthur\\b|\\bConan\\b|\\bDoyle\\b|\\b15\\b|\\bin\\b|\\bour\\b|\\bseries\\b|\\bCopyright\\b|\\blaws\\b|\\bare\\b|\\bchanging\\b|\\ball\\b|\\bover\\b|\\bthe\\b|\\bworld\\b|\\bBe\\b|\\bsure\\b|\\bto\\b|\\bcheck\\b|\\bcopyright\\b|\\bfor\\b|\\byour\\b|\\bcountry\\b|..."

，然後簡單地算的話，你可以做nrv_df$counts <-添加此爲一列...

> stri_count_regex(nrv_df$lead_para[25000:26000], p_join2, stri_opts_regex(case_insensitive = TRUE)) 
[1] 12 11 8 13 7 7 6 7 6 8 12 1 6 7 8 3 5 3 5 5 5 4 7 5 5 5 5 5 10 2 8 13 5 8 9 7 6 5 7 5 9 8 7 5 7 8 5 6 0 8 6 
[52] 3 4 0 10 7 9 8 4 6 8 8 7 6 6 6 0 3 5 4 7 6 5 7 10 8 10 10 11

編輯：

因爲它是沒有結果發現數量匹配... 首先爲每個段落做功並檢測p2中是否存在lead_paragraph

f <- function(i, j){ 
    if(any(stri_detect_fixed(i, j, omit_no_match = TRUE))){ 
     1 
    }else { 
     0 
    } 
}

現在...在Linux上使用parallel庫。而且，只有測試1000行，因爲它是一個例子給了我們：

library(parallel) 
library(stringi) 
> rst <- mcmapply(function(x){ 
    f(i = x, j = p2) 
}, vdf2$lead_paragraph[1:1000], 
mc.cores = detectCores() - 2, 
USE.NAMES = FALSE) 
> rst 
    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
    [70] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 
[208] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[277] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[346] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 
[415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[484] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[553] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[622] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[691] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[760] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[829] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[898] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 
[967] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

來源

2017-08-29 00:26:12

兩種解決方案都很好，謝謝。根據需要，當我包含整個數據集（4mil行）時不會導致崩潰，也不需要分割分析。 – chydock

這也適用於：

：比以前的解決方案快

library(corpus) 

# simulate the problem as in @carl-boneri's answer 
lead_para <- readLines("http://www.norvig.com/big.txt") 

# get a random sample of 2000 word types 
types <- text_types(lead_para, collapse = TRUE) 
p <- sample(types, 2000) 

# find whether each entry has at least one of the terms in `p` 
ix <- text_detect(lead_para, p)

即使只使用單核，它的20倍以上

system.time(ix <- text_detect(lead_para, p)) 
## user system elapsed 
## 0.231 0.008 0.240 

system.time(rst <- mcmapply(function(x) f(i = x, j = p_join2), 
          lead_para, mc.cores = detectCores() - 2, 
          USE.NAMES = FALSE)) 
## user system elapsed 
## 11.604 0.240 5.805

來源

2017-10-04 22:44:36

如果在400萬觀測數據集的每一行中出現一個字，則計數

回答

編輯：

相關問題