2017-09-27 167 views
-3

我需要幫助解析文本文件CSV。我的文本文件看起來像這樣:解析文本文件以CSV在python

12: IBD08; ANALYSIS AND CHARACTERISATION OF THE FAECAL MICROBIAL DEGRADOME IN INFLAMMATORY BOWEL DISEASE 
Identifiers: BioSample: SAMEA3914946; SRA: ERS1102080 
Organism: Homo sapiens 
Attributes: 
    /sample name="ERS1102080" 
    /collection date="2011" 
    /environment biome="Intestine" 
    /environment feature="Colon" 
    /environment material="Faecal" 
    /geographic location (country and/or sea)="United Kingdom" 
    /host body product="Faeces" 
    /host disease status="Healthy" 
    /human gut environmental package="human-gut" 
    /investigation type="metagenome" 
    /latitude (raw)="51??31'03.3" 
    /longitude (raw)="0??10'25.2" 
    /project name="IBD gut" 
    /sequencing method="Illumina Miseq" 
Description: 
Multi 'omic analysis of the gut microbiome in IBD 
Accession: SAMEA3914946 ID: 5788180 
2: qiita_sid_833:833.Sweden.IBD.102A; 833.Sweden.IBD.102A 
Identifiers: BioSample: SAMEA3924619; SRA: ERS1111753 
Organism: gut metagenome 
Attributes: 
    /sample name="ERS1111753" 
    /sex="male" 
    /age="3.9" 
    /age group="2.0" 
    /age unit="years" 
    /altitude="0" 
    /anonymized name="Sweden.IBD.102A" 
    /antibiotics="definite_no" 
    /assigned from geo="False" 
    /barcodesequence="CTGCTATTCCTC" 
    /body habitat="UBERON:feces" 
    /body product="UBERON:feces" 
    /tissue="UBERON:feces" 
    /breed="Great_Dane" 
    /breed grouping="Working" 
    /collection date="1/30/12" 
    /collection timestamp="1/30/12" 
    /common name="gut metagenome" 
    /geographic location="Sweden: GAZ" 
    /depth="0" 
    /disease="IBD" 
    /dna extracted="True" 
    /elevation="13.02" 
    /emp status="NOT_EMP" 
    /environment biome="ENVO:urban biome" 
    /environment feature="ENVO:animal-associated habitat" 
    /env matter="ENVO:feces" 
    /experiment center="Texas A&M" 
    /experiment design description="Fecal samples from dogs of various breeds, places of origin, and severity of bowel disorder were sequencing to obtain a dog gut metagenome." 
    /experiment title="suchodolski_dog_ibd" 
    /gender specific="M" 
    /has extracted data="True" 
    /has physical specimen="True" 
    /histo="both" 
    /host="domestic dog" 
    /host="Canis lupus familiaris" 
    /host subject id="Sweden.IBD.102A" 
    /host taxonomy ID="9615" 
    /illumina technology="HiSeq" 
    /latitude="60.13" 
    /library construction protocol="This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers F515 and R806 were developed against the V4 region of the 16S rRNA, both bacteria and archaea, which we determined would yield optimal community clustering with reads of this length The reverse PCR primer is barcoded with a 12-base error-correcting Golay code to facilitate multiplexing of up to 1,500 samples per lane, and both PCR primers contain sequencer adapter regions." 
    /linker="GT" 
    /linkerprimersequence="GTGCCAGCMGCCGCGGTAA" 
    /longitude="18.64" 
    /pcr primers="FWD:GTGCCAGCMGCCGCGGTAA; REV:GGACTACHVGGGTWTCTAAT" 
    /physical location="CCME" 
    /physical specimen location="Texas A&M" 
    /physical specimen remaining="False" 
    /platform="Illumina" 
    /platformchemistry="HiSeq_V4" 
    /pool name="R.K.1.20.12" 
    /primer plate="1" 
    /public="False" 
    /required sample info status="completed" 
    /run center="CCME" 
    /run date="1/30/12" 
    /run prefix="Suchodolski_dog_ibd" 
    /sample size="0.1, gram" 
    /sample center="Texas A&M" 
    /sample plate="IBD1" 
    /sequencing meth="sequencing by synthesis" 
    /size grouping="large" 
    /study center="Texas A&M" 
    /target gene="16S rRNA" 
    /target subfragment="V4" 
    /title="Suchodolski_dog_ibd" 
    /total mass="54.0" 
    /weight group="5.0" 
    /weight kg="54.0" 
    /well id="H6" 
Description: 
IBD1_Sweden_IBD_102A_H6_R.K.1.20.12 
Accession: SAMEA3924619 ID: 5507372 

輸出格式: 項目名稱生物標本SRA生物樣品名稱等... IBD08 SAMEA3914946 ERS1102080智人ERS1102080

每個項目都具有不同的字段。如何製作所有項目中的每個領域的專欄。在此先感謝

+2

是你想達到什麼樣的輸出格式?請編輯該問題以包含此內容。 –

+0

[Python的解析CSV正確(https://stackoverflow.com/questions/12296585/python-parse-csv-correctly) –

回答

0

你的兩個例子,有非常不同的領域,但你仍然可以創建一個包含所有你需要的字段CSV如下:

from itertools import groupby, takewhile, ifilter 
import re 
import csv 

heading = None 
sub_headings = ['Identifiers', 'Organism'] 
attribute_fields = [] 

# First scan to determine list of all used attribute_fields 
with open('projects.txt') as f_projects: 
    re_attributes = re.compile(r' \/(.*?)=".*"') 

    for line in f_projects: 
     # ' /sample size="0.1, gram"' 
     re_attribute = re_attributes.match(line) 

     if re_attribute: 
      attribute_fields.append(re_attribute.group(1)) 

# Remove duplicate attributes, sort and prefix the top fields 
attribute_fields = ['Description', 'id', 'Accession', 'AccessionID'] + sorted(set(attribute_fields))  

with open('projects.txt') as f_projects, open('output.csv', 'wb') as f_output: 
    csv_output = csv.DictWriter(f_output, fieldnames=sub_headings + attribute_fields) 
    csv_output.writeheader() 

    skip_empty_lines = ifilter(lambda x: len(x.strip()), f_projects) 

    for k, v in groupby(skip_empty_lines, lambda x: re.match('\d+: ', x)): 
     if k: 
      heading = next(v).strip() 
     elif heading: 
      row = {'id' : heading} 
      lines = list(v) 

      for line_number, line in enumerate(lines): 
       for sub_heading in sub_headings: 
        if line.startswith(sub_heading): 
         row[sub_heading] = line.split(':', 1)[1].strip() 

       if line.startswith('Attributes:'): 
        for attribute in takewhile(lambda x: x.startswith(' /'), iter(lines[line_number+1:])): 
         k, v = re.findall(r'/(.*?)="(.*?)"', attribute)[0] 
         row[k] = v 

       if line.startswith('Description:'): 
        row['Description'] = lines[line_number+2].strip() # use next line only 

       # Accession: SAMN00030407\tID: 30407 
       if line.startswith('Accession:'): 
        accession, accession_id = re.match('Accession: (.*?)\tID: (.*?)$', line).groups() 
        row.update({'Accession':accession, 'AccessionID':accession_id}) 

      csv_output.writerow(row) 

如下這將產生一個相當稀疏輸出CSV:

Identifiers,Organism,Description,id,Accession,AccessionID,!16S_BarcodeSequence,"""PUBLIC""",16S_ForwardPrimer,16S_LinkerPrimerSequence,ArrayExpress-Species,ENA-CHECKLIST,ENA-FIRST-PUBLIC,ENA-LAST-UPDATE,HCA_MBT,HEIGHT,ITS2_BarcodeSequence,ITS2_LinkerPrimerSequence,PUBLIC,PlatformChemistry,Species,TOTAL_SCCA,WEIGHT,age,age at fmt,age group,age unit,age_unit,altitude,analyte type,anonymized name,anonymized_name,antibiotics,assigned from geo,assigned_from_geo,barcoded primer name,barcoded_primer_name,barcodesequence,bcs,bcs grouping,biomaterial provider,biospecimen repository,biospecimen repository sample id,body habitat,body mass index,body product,breed,breed grouping,calprotectin,cd behavior,cd location,cd resection,chemical administration,collection date,collection timestamp,common name,common_name,crude fiber 1000kcalg me group,cultivar,day since fmt,depth,description,detail,dewormed,diagnosis full,disease,disease control,dna extracted,donor group,donor kind,donor or patient,donor_recipient,ecotype,elevation,emp status,env matter,env_matter,environment biome,environment feature,environment material,environmental package,ethnicity,exp code,experiment center,experiment design description,experiment title,fecal date,fmt modality,g fat 1000kcal me group,g protein 1000kcal me group,gastrointestinal tract disorder,gender specific,geographic location,geographic location (country and/or sea),has extracted data,has physical specimen,health state,histo,histological type,hospitalized for fmt,host,host age,host body mass index,host body product,host disease,host disease status,host family relationship,host genotype,host sex,host subject id,host taxonomy ID,host tissue sampled,host-associated environmental package,human gut environmental package,ibd,ibd or not,ibd subtype,illumina technology,immune_state,immunocompromized,indiv g fat 1000kcal me group,indiv g protein 1000kcal me group,individual,indoor outdoor,inflammed,investigation type,isolate,isolation and growth condition,isolation source,lane,latitude,latitude (raw),latitude and longitude,library construction protocol,linker,linkerprimersequence,longitude,longitude (raw),marital status,mid,miscellaneous parameter,molecular data type,mouse_number,non barcoded linker,non barcoded primer,non barcoded primer name,non_barcoded_linker,non_barcoded_primer,non_barcoded_primer_name,number courses metronidazole,number fidaxo courses,number ivig,number prior episodes,number prior fmt,number recurrence after fmt,number std vanco courses,number vanco tapers,pathology,patient,patientnumber,pcr primers,pcr_primers,perc crude protein min group,percent crude fat min group,percent crude fiber max group,percent met cal carb group,percent met cal fat group,percent met cal protein group,perianal disease,perturbation,phenotype,physical location,physical specimen location,physical specimen remaining,platform,platformchemistry,pool name,postfmt cdi result,postfmt symptoms,prebotic source,primer plate,project name,protein source,public,race code,replicate,required sample info status,run center,run date,run prefix,sample center,sample collection device or method,sample id,sample name,sample no ngs nr,sample plate,sample size,sample storage temperature,sample type,sample_code,sample_id,sampling_time,secondary description,separate first,separate first and donor,seq_meth,sequencing meth,sequencing method,sex,size grouping,source material identifiers,state,state us,strain,study,study center,study design,study id,study name,subject,subject code,submitted sample id,submitted subject id,submitter handle,target gene,target subfragment,target_gene,target_subfragment,taxon id,taxon_id,terminal ileum,time_point,time_point_label,time_point_months,timepoint,tissue,title,total mass,travel history,treatment_parasite,uc extent,unknown,vanc plus rif chaser,visit_num,weight group,weight kg,well id,year diagnosed 
BioSample: SAMEA3914960,Homo sapiens,Accession: SAMEA3914960 ID: 5788191,1: IBD22; ANALYSIS AND CHARACTERISATION OF THE FAECAL MICROBIAL DEGRADOME IN INFLAMMATORY BOWEL DISEASE,SAMEA3914960,5788191,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011,,,,,,,,,,,,,,,,,,,,,,,,Intestine,Colon,Faecal,,,,,,,,,,,,,,United Kingdom,,,,,,,,,,Faeces,,Inflammatory Bowel Disease,,,,,,,,human-gut,,,,,,,,,,,,metagenome,,,,,,51??31'03.3,,,,,,0??10'25.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IBD gut,,,,,,,,,,,,ERS1102094,,,,,,,,,,,,,,Illumina Miseq,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 
BioSample: SAMEA3914951; SRA: ERS1102085,Homo sapiens,Accession: SAMEA3914951 ID: 5788190,2: IBD13; ANALYSIS AND CHARACTERISATION OF THE FAECAL MICROBIAL DEGRADOME IN INFLAMMATORY BOWEL DISEASE,SAMEA3914951,5788190,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011,,,,,,,,,,,,,,,,,,,,,,,,Intestine,Colon,Faecal,,,,,,,,,,,,,,United Kingdom,,,,,,,,,,Faeces,,Inflammatory Bowel Disease,,,,,,,,human-gut,,,,,,,,,,,,metagenome,,,,,,51??31'03.3,,,,,,0??10'25.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IBD gut,,,,,,,,,,,,ERS1102085,,,,,,,,,,,,,,Illumina Miseq,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 

測試Python的2.7.12

+0

感謝您的幫助馬丁埃文斯可能的複製。我引用了我的文本文件中的兩個例子,就像這個我有大約2000個項目。這會對所有人都有用嗎?如果一個項目沒有這些領域,它會將價值視爲NA還是空白? –

+0

目前,如果一個項目包含一個看不見的領域,該腳本將停止,並告訴你缺少哪個字段。然後你可以簡單地把它複製到'headings_field'中。或者,您可以告訴它忽略缺少的字段。此外,缺少的字段目前保持空白。如果你想'NA',通過'restval ='NA''作爲一個參數設置爲'DictWriter()'。 –

+0

感謝Martin Evans我會嘗試下面的代碼。 :) –