2017-04-14 53 views
0

基於在全文關鍵字線後我想提取從使用R.提取之前和,使用R編程

我想之前和之後的線或段落含有字,以提取PDF的文章「癌症」相關的信息癌症在文本文件。

abstracts <- lapply(mytxtfiles, function(i) { 
j <- paste0(scan(i, what = character()), collapse = " ") 
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))}) 

上述正則表達式不工作

+1

'[癌症]'='cancer'!第一個是角色類,後者是字面類。 – Jan

+0

如果使用'\ R',則必須使用'perl = TRUE'。 –

+0

用'。*'和'[cancer] [^ \\ r \\ n] *'替換所有'[^ \ r \ n] *''。見['(?m)(^。* \ R +){4}。* cancer。*(\ R +。*){4}'](https://regex101.com/r/Hbr9ep/1)。如果沒有足夠的行,請用'{0,4}'替換'{4}'。 –

回答

0

這裏有一個辦法:

library(textreadr) 
library(tidyverse) 

loc <- function(var, regex, n = 1, ignore.case = TRUE){ 
    locs <- grep(regex, var, ignore.case = ignore.case) 
    out <- sort(unique(c(locs - 1, locs, locs + 1))) 
    out <- out[out > 0] 
    out[out <= length(var)] 
} 

doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>% 
    read_pdf() %>% 
    slice(loc(text, 'cancer')) 

doc 

## page_id element_id                             text 
## 1  24   28        Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private 
## 2  24   29        partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but 
## 3  24   30        stresses that, in order for them to work, they should be voluntary, and the government 
## 4  25   8       the availability of medicines to treat life-threatening diseases. It notes, for example, that 
## 5  25   9        while an average estimate of the value of drugs to treat the country's cancer patients is 
## 6  25   10        $1.11 billion, the market is in fact worth only $33.5 million. 「The big gap indicates the 
## 7  25   12       because of the high cost of these medicines,」 says the Policy, which also calls for tax and 
## 8  25   13                    excise exemptions for anti-cancer drugs. 
## 9  25   14      Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health 
## 10  32   19        Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, 
## 11  32   20        anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended 
## 12  32   21        December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 
+0

謝謝你給我不同的方法。我們可以做到這一點多個pdf存儲在特定的位置。同樣使用這個,我能夠提取含有癌症字符的行,而不是前後的行。我如何提取包含單詞'cancer –

+0

'的行的前後行是的,您可以爲多個dfs執行操作。請參閱'read_dir'函數。我已經在前後顯示了上述線條,所以我不知道你前後的線條是什麼意思。例如,第29行有癌症這個詞。我也包括第28和30行。 –

+0

我們可以把句子分開嗎?我正在考慮把一行作爲一句完整的句子。 –