插入一個換行符字符串中的每10個字符使用朱莉婭

我要插入一個換行符在蛋白質序列，每10個字符：插入一個換行符字符串中的每10個字符使用朱莉婭

seq="MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ"

在Perl中，這是很容易：

$seq=~s/(.{10})/$1\n/g ; # does the job! 

perl -e '$seq="MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ"; $seq=~s/(.{10})/$1\n/g; print $seq' 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ

在朱莉婭，

replace(seq, r"(.{10})" , "\n")

不起作用，因爲我不知道一種方式來獲得捕獲組（{10}）和substitu與本身忒它+「\ n」

julia> replace(seq, r"(.{10})" , "\n") 
"\n\n\n\n\n\n"

因此，要做到這一點，我需要兩個步驟：

julia> a=matchall(r"(.{1,10})" ,seq) 
    6-element Array{SubString{UTF8String},1}: 
    "MSKNKSPLLN" 
    "ESEKMMSEML" 
    "PMKVSQSKLN" 
    "YEEKVYIPTT" 
    "IRNRKQHCFR" 
    "RFFPYIALFQ" 

    julia> b=join(a, "\n") 
    "MSKNKSPLLN\nESEKMMSEML\nPMKVSQSKLN\nYEEKVYIPTT\nIRNRKQHCFR\nRFFPYIALFQ" 

    julia> println(b) 
    MSKNKSPLLN 
    ESEKMMSEML 
    PMKVSQSKLN 
    YEEKVYIPTT 
    IRNRKQHCFR 
    RFFPYIALFQ 

# Caution :  
a=matchall(r"(.{10})" ,seq) # wrong if seq is not exactly a multiple of 10 ! 

julia> seq 
"MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIAL" 

julia> matchall(r"(.{10})" ,seq) 
5-element Array{SubString{UTF8String},1}: 
"MSKNKSPLLN" 
"ESEKMMSEML" 
"PMKVSQSKLN" 
"YEEKVYIPTT" 
"IRNRKQHCFR" 

julia> matchall(r"(.{1,10})" ,seq) 
6-element Array{SubString{UTF8String},1}: 
"MSKNKSPLLN" 
"ESEKMMSEML" 
"PMKVSQSKLN" 
"YEEKVYIPTT" 
"IRNRKQHCFR" 
"RFFPYIAL"

有沒有一步到位的解決方案或更好的（更快）的方式？

只是爲了有趣的基準與所有這些有趣的答案！（更新與朱莉婭5.0）

function loop(a) 
last = 0 
#create the interval, in your case 10 
salt = 10 
#iterate in string (starts in the 10th value, don't forget julia use 1 to first index) 
for i in salt:salt+1:length(a) 
    # replace the string for a new one with '\n' 
    a = string(a[1:i], '\n', a[i+1:length(a)]) 
    last = Int64(i) 
end 
# replace the rest 
a = string(a[1:length(a) - last % salt + 1], '\n', a[length(a) - last % salt + 2:length(a)]) 
println(a) 
end 

function regex1(seq) 
    a=matchall(r"(.{1,10})" ,seq) 
    b=join(a, "\n") 
    println(b) 
end 

function regex2(seq) 
    a=join(split(replace(seq, r"(.{10})", s"\1 ")), "\n") 
    println(a) 
end 

function regex3(seq) 
    a=replace(seq, r"(.{10})", Base.SubstitutionString("\\1\n")) 
    a= chomp(a) # because there is a new line at the end 
    println(a) 
end 

function intrapad(seq::String) 
    buf = IOBuffer((length(seq)*11)>>3) # big enough buffer 
    for i=1:10:length(seq) 
    write(buf,SubString(seq,i,i+9),'\n') 
    end 
    #return 
    print(takebuf_string(buf)) 
end 

function join_substring(seq) 
    a=join((SubString(seq,i,i+9) for i=1:10:length(seq)),'\n') 
    println(a) 
end 

seq="MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ" 

for i = 1:5 
    println("loop :") 
    @time loop(seq) 
    println("regex1 :") 
    @time regex1(seq) 
    println("regex2 :") 
    @time regex2(seq) 
    println("regex3 :") 
    @time regex3(seq) 
    println("intrapad :") 
    @time intrapad(seq) 
    println("join substring :") 
    @time join_substring(seq) 
end

我改變基準來執行5次@time和我張貼在這裏5執行@time的後的結果：

loop : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIA 
LFQ 
    0.000013 seconds (53 allocations: 3.359 KB) 
regex1 : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000013 seconds (49 allocations: 1.344 KB) 
regex2 : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000017 seconds (47 allocations: 1.703 KB) 
regex3 : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000013 seconds (31 allocations: 976 bytes) 
intrapad : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000007 seconds (9 allocations: 608 bytes) 
join substring : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000012 seconds (21 allocations: 800 bytes)

Intrapad現在第一;）

來源

2016-11-11 Fred

不知道關於另一解決方案，但2個步驟可以變化到一個襯片是這樣的：'SEQ = 「MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ」;' '的println（合併（matchall（R，SEQ 「（{10}）」。），「\ n」））;' – AbhiNickz

所以我檢查了一遍文檔：「{10}」 '調用println（更換（「ABHISHEKBHASKERMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ」，R，S 「一個\ g <0> SSS」））;' 這兒如果我將sss替換爲\ n這應該有效，但是根據文檔「通過使用\ n來引用第n個捕獲組」這是這裏的問題。 – AbhiNickz

是的，@AbhiNickz替換（seq，r「（。{10}）」，s「\ g <0> \ n」）會產生一個錯誤，但是插入一個blanc是個很好的解決方案：replace（seq，r 「（。{10}）」，s「\ g <0>」）ok – Fred

像@daycaster建議，你可以使用s"\1"作爲替換字符串支持捕獲組。問題在於特殊的s""字符串語法不支持特殊字符，如\n。您可以通過手動構建SubstitutionString對象解決這個問題，但你需要躲避\在\1：

julia> replace(seq, r"(.{10})", Base.SubstitutionString("\\1\n")) 
"MSKNKSPLLN\nESEKMMSEML\nPMKVSQSKLN\nYEEKVYIPTT\nIRNRKQHCFR\nRFFPYIALFQ\n"

來源

2016-11-11 16:34:22

喜歡的東西：

julia> split(replace(seq, r"(.{10})", s"\1 ")) 
6-element Array{SubString{String},1}: 
"MSKNKSPLLN" 
"ESEKMMSEML" 
"PMKVSQSKLN" 
"YEEKVYIPTT" 
"IRNRKQHCFR" 
"RFFPYIALFQ"

如果你想作爲一個字符串，使用join()：

julia> join(split(replace(seq, r"(.{10})", s"\1 ")), "\n") 
"MSKNKSPLLN\nESEKMMSEML\nPMKVSQSKLN\nYEEKVYIPTT\nIRNRKQHCFR\nRFFPYIALFQ" 

julia> println(ans) 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ

來源

2016-11-11 11:58:46 daycaster

結果是一個數組，就像：matchall（r「（。{10}）」，seq） – Fred

對於包含空格的字符串，這將失敗。 –

@MattB。我的蛋白質序列包含空格？所以這就是爲什麼我總是餓... ...！ – daycaster

我不知道你怎麼可以用正則表達式做，但我認爲它可以解決你的問題：

a = "oiaoueaoeuaoeuaoeuaoeuaoteuhasonetuhaonetuahounsaothunsaotuaosu" 
last = 0 
#create the interval, in your case 10 
salt = 10 
#iterate in string (starts in the 10th value, don't forget julia use 1 to first index) 
for i in salt:salt+1:length(a) 
    # replace the string for a new one with '\n' 
    a = string(a[1:i], '\n', a[i+1:length(a)]) 
    last = Int64(i) 
end 
# replace the rest 
a = string(a[1:length(a) - last % salt + 1], '\n', a[length(a) - last % salt + 2:length(a)]) 
println(a)

來源

2016-11-11 14:15:16 pmargreff

比Perl版本更具可讀性:) – daycaster

對於包含非ASCII字符的字符串，這將失敗。 –

@MattB。我該如何糾正它？ – pmargreff

如果速度是一個問題，它可能是最好避免較重的工具，如正則表達式，並嘗試就像這樣：

function intrapad(seq::String) 
    buf = IOBuffer((length(seq)*11)>>3) # big enough buffer 
    for i=1:10:length(seq) 
    write(buf,SubString(seq,i,i+9),'\n') 
    end 
    return takebuf_string(buf) 
end

速度來自使用IOBuffer和SubStrings最小化分配。使用BenchmarkTools軟件包我們有：

julia> @benchmark intrapad(seq) 
BenchmarkTools.Trial: 
    memory estimate: 624.00 bytes 
    allocs estimate: 10 
    minimum time:  729.00 ns (0.00% GC) 
    median time:  767.00 ns (0.00% GC) 
    mean time:  862.99 ns (7.84% GC) 
    maximum time:  26.86 μs (96.21% GC) 

julia> @benchmark replace(seq, r"(.{10})", Base.SubstitutionString("\\1\n")) 
BenchmarkTools.Trial: 
    memory estimate: 720.00 bytes 
    allocs estimate: 26 
    minimum time:  2.18 μs (0.00% GC) 
    median time:  2.29 μs (0.00% GC) 
    mean time:  2.43 μs (3.85% GC) 
    maximum time:  531.31 μs (98.95% GC)

只有2.5倍加速。 replace函數很好的實現！

另一種方式去無正則表達式是

join((SubString(seq,i,i+9) for i=1:10:length(seq)),'\n')

這是不一樣快（慢10倍，我的機器上沒有內存分配點球），但可讀性很強。

來源

2016-11-12 09:39:42

這些函數僅適用於ASCII字符串，因爲它們依賴字節每字符索引。但是在基因組序列之類的情況下，它應該沒問題（或者在未來版本的Julia中，檢查字符串） –

最後的連接示例給出了以下錯誤：錯誤：LoadError：語法：元組中缺少分隔符 –

這是一個版本問題。這個例子工作在0.5。你有0.4嗎？ –

插入一個換行符字符串中的每10個字符使用朱莉婭

回答

相關問題