2017-05-06 80 views
1

我使用quanteda進行基於字典的方法進行定量文本分析。我正在與Lowe的Yoshikoder建立我自己的字典。我可以將我的Yoshikoder字典應用到quanteda(見下文) - 但是,該函數僅佔字典的第一級。我需要查看每個類別的所有值,包括所有子類別(至少4個級別)。我怎樣才能做到這一點?Quanteda:應用Yoshikoder多級字典

# load my Yoshikoder dictionary with multiple levels 
mydict <- dictionary(file = "mydictionary.ykd", 
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto") 

# apply dictionary 
mydfm <- dfm(mycorpus, dictionary = mydict) 
mydfm 
# problem: shows only results for the first level of the dictionary 

回答

1

dfm_lookup(和tokens_lookup)有levels參數缺省爲1:5。嘗試單獨申請查詢:

mydfm <- dfm(mycorpus) 
dfm_lookup(mydfm, dictionary = mydict) 

或:

mytoks <- tokens(mycorpus) 
mytoks <- tokens_lookup(mytoks, dictionary = mydict) 
dfm(mytoks) 

更新:

在v0.9.9.55現在固定。

> library(quanteda) 
# Loading required package: quanteda 
# quanteda version 0.9.9.55 
# Using 7 of 8 cores for parallel computing 

> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd") 
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE) 
# Creating a dfm from a corpus ... 
# ... tokenizing texts 
# ... lowercasing 
# ... found 14 documents, 5,058 features 
# ... applying a dictionary consisting of 19 keys 
# ... created a 14 x 19 sparse dfm 
# ... complete. 
# Elapsed time: 0.422 seconds. 

> mydict 
# Dictionary object with 9 primary key entries and 2 nested levels. 
# - Economy: 
#  - +State+: 
#  - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow* 
#  - =State=: 
#  - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work* 
#  - -State-: 
#  - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher* 
#  - Institutions: 
#  - Radical: 
#  - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice* 
#  - Neutral: 
#  - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote* 
#  - Conservative: 
#  - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike* 
#  - Values: 
#  - Liberal: 
#  - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex* 
#  - Conservative: 
#  - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition* 
#  - Law and Order: 
#  - Liberal: 
#  - harassment, non-custodial 
# - Conservative: 
#  - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan* 
#  - Environment: 
#  - Pro: 
#  - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl* 
# 
#  ... 
+0

謝謝您的幫助。我嘗試了你的兩個建議,第一次運行,但仍然適用字典只有2個鍵(意味着只有第一級,也如果我設置levels = 5),第二個結果在以下錯誤:「錯誤在qatd_cpp_tokens_lookup( x,keys,entries_id,keys_id,FALSE):與請求的類型不兼容。「 – Sera

+0

以任何方式,我認爲這個問題相當於之前的步驟(加載字典),因爲字典僅僅加載爲2(「2鍵」)的列表,只會計算字典的第一級...?我也嘗試過使用Yoshikoder格式的Laver和Garry字典 - 同樣的問題。然而,如果Laver和Garry的詞典以Wordstat格式加載,它將佔所有級別... – Sera

+0

您能發送字典文件和dfm對象,以便我們可以測試嗎? –

1

,而我在Quanteda固定它,嘗試過坍塌此類別替換功能:

library(xml2) 

read_dict_yoshikoder <- function(path, sep=">"){ 
    doc <- xml2::read_xml(path) 
    pats <- xml2::xml_find_all(doc, ".//pnode") 
    pnode_names <- xml2::xml_attr(pats, "name") 
    get_pnode_path <- function(pn) { 
    pars <- xml2::xml_attr(xml2::xml_parents(pn), "name") 
    paste0(rev(na.omit(pars)), collapse = sep) 
    } 
    pnode_paths <- lapply(pats, get_pnode_path) 
    lst <- split(pnode_names, unlist(pnode_paths)) 
    dictionary(lst) 
} 

用法:

read_dict_yoshikoder("laver-garry-ajps.ykd") 

Dictionary object with 19 key entries. 
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre* 
- Laver and Garry>Culture>Popular: media 
- Laver and Garry>Culture>Sport: angler* 
- Laver and Garry>Environment>Con: produc* 
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl* 
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci* 

... 
+0

感謝您提供這種替代解決方案 - 但是,在運行該功能時,崩潰= sep似乎有問題。錯誤說:錯誤在paste0(rev(na.omit(pars)),collapse = sep): 承諾已經在評估:遞歸默認參數引用或更早的問題? 調用時間:paste0(rev(na.omit(pars)),collapse = sep) – Sera

+0

糟糕。現在應該修復。 @Sera – conjugateprior

+0

謝謝,這可以作爲一種替代解決方案,而量子中的Yoshikoder閱讀器尚未修復! – Sera