以大寫字母拆分字符串

什麼是在給定字符集發生之前拆分字符串的pythonic方法？以大寫字母拆分字符串

例如，我想在大寫字母（可能除了第一）的任何發生分裂 'TheLongAndWindingRoad' ，並獲得 ['The', 'Long', 'And', 'Winding', 'Road']。

編輯：還應該拆單出現，即從我'ABC'想獲得 ['A', 'B', 'C']。

2010-02-17 Federico A. Ramponi

不幸的是，在Python中不可能使用split on a zero-width match。但是你可以使用re.findall代替：

>>> import re 
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad') 
['The', 'Long', 'And', 'Winding', 'Road'] 
>>> re.findall('[A-Z][^A-Z]*', 'ABC') 
['A', 'B', 'C']

來源

2010-02-17 00:04:44

請注意，這會在第一個大寫字母之前刪除任何字符。 'Long'和'WindingRoad'會導致['Long'，'And'，'Winding'，'Road'] – 2016-07-14 13:44:03

@MarcSchulder：如果你需要這種情況，只需使用''[a-zA-Z] [^ AZ] *'作爲正則表達式。 – knub 2017-02-10 14:01:43

import re 
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))

或

[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]

來源

2010-02-17 00:07:51 Gabe

這個過濾器是完全沒有必要的，並且不需要通過直接regex與捕獲組進行拆分：'[s for re.compile（r「（[AZ] [^ AZ] *）」）。split（「TheLongAndWindingRoad」）如果s]給出'['The'，'Long'，''''，'Winding'，'Road']' – smci 2013-06-29 22:15:21

@smci：'filter'的這種用法與帶有條件的列表理解相同。你有什麼反對嗎？ – Gabe 2013-06-30 04:18:33

我知道它可以被一個條件的列表理解所替代，因爲我剛剛發佈了那個代碼，然後你就複製了它。這裏有三個原因列表理解最好：一）*易讀的成語：*列表內涵是一個更Python的成語和讀取更清晰的左到右比'過濾器（lambdaconditionfunc，...）' B）在Python 3中，'filter（）'返回一個迭代器。所以他們不會完全等效。 c）我預計'filter（）'也會更慢 – smci 2013-07-01 08:17:07

>>> import re 
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad') 
['The', 'Long', 'And', 'Winding', 'Road'] 

>>> re.findall('[A-Z][a-z]*', 'SplitAString') 
['Split', 'A', 'String'] 

>>> re.findall('[A-Z][a-z]*', 'ABC') 
['A', 'B', 'C']

如果你想"It'sATest"拆分到["It's", 'A', 'Test']變化rexeg到"[A-Z][a-z']*"

來源

2010-02-17 00:14:03

+1：首先讓ABC工作。我現在也更新了我的答案。 – 2010-02-17 00:19:27

>>> re.findall（'[A-Z] [a-z] *'，「它約佔經濟的70％」） - > ['It'，'Economy'] – ChristopheD 2010-02-17 00:50:46

@ChristopheD。 OP沒有說明如何處理非字母字符。 – 2010-02-17 01:00:11

替代解決方案（如果您不喜歡明確的正則表達式）：

s = 'TheLongAndWindingRoad' 

pos = [i for i,e in enumerate(s) if e.isupper()] 

parts = [] 
for j in xrange(len(pos)): 
    try: 
     parts.append(s[pos[j]:pos[j+1]]) 
    except IndexError: 
     parts.append(s[pos[j]:]) 

print parts

來源

2010-02-17 00:37:13 ChristopheD

上@ChristopheD的溶液

s = 'TheLongAndWindingRoad' 

pos = [i for i,e in enumerate(s+'A') if e.isupper()] 
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)] 

print parts

來源

2010-02-17 02:01:39 pwdyson

不錯的 - 這也適用於非拉丁字符。這裏顯示的正則表達式解決方案沒有。 – AlexVhr 2013-02-03 07:43:02

的變化在這裏是備選的正則表達式的解決方案。這個問題可以reprased爲「我怎麼做拆分前的每個大寫字母前插入空格」：

>>> s = "TheLongAndWindingRoad ABC A123B45" 
>>> re.sub(r"([A-Z])", r" \1", s).split() 
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

這有保留所有的非空白字符，其中大部分其他解決方案沒有的優勢。

來源

2010-02-17 08:19:04

src = 'TheLongAndWindingRoad' 
glue = ' ' 

result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)

來源

2014-07-07 11:04:03 user3726655

請問您可以添加解釋爲什麼這是解決問題的好辦法。 – 2014-07-07 11:22:35

對不起。我忘了最後一步 – user3726655 2014-07-08 12:34:45

不使用正則表達式或列舉的另一種方法：我認爲這是沒有鏈接太多的方法或用一個長長的清單理解，可以是難以閱讀更清晰，更簡單的

word = 'TheLongAndWindingRoad' 
list = [x for x in word] 

for char in list: 
    if char != list[0] and char.isupper(): 
     list[list.index(char)] = ' ' + char 

fin_list = ''.join(list).split(' ')

。

來源

2014-12-07 06:48:06 PieOhPah

使用enumerate和的備用方法isupper()

代碼：

strs = 'TheLongAndWindingRoad' 
ind =0 
count =0 
new_lst=[] 
for index, val in enumerate(strs[1:],1): 
    if val.isupper(): 
     new_lst.append(strs[ind:index]) 
     ind=index 
if ind<len(strs): 
    new_lst.append(strs[ind:]) 
print new_lst

輸出：

['The', 'Long', 'And', 'Winding', 'Road']

來源

2016-02-10 12:50:56 The6thSense

另一個沒有正則表達式，並保持連續的大寫字母，如果想

def split_on_uppercase(s, keep_contiguous=False): 
    """ 

    Args: 
     s (str): string 
     keep_contiguous (bool): flag to indicate we want to 
           keep contiguous uppercase chars together 

    Returns: 

    """ 

    string_length = len(s) 
    is_lower_around = (lambda: s[i-1].islower() or 
         string_length > (i + 1) and s[i + 1].islower()) 

    start = 0 
    parts = [] 
    for i in range(1, string_length): 
     if s[i].isupper() and (not keep_contiguous or is_lower_around()): 
      parts.append(s[start: i]) 
      start = i 
    parts.append(s[start:]) 

    return parts 

>>> split_on_uppercase('theLongWindingRoad') 
['the', 'Long', 'Winding', 'Road'] 
>>> split_on_uppercase('TheLongWindingRoad') 
['The', 'Long', 'Winding', 'Road'] 
>>> split_on_uppercase('TheLongWINDINGRoadT', True) 
['The', 'Long', 'WINDING', 'Road', 'T'] 
>>> split_on_uppercase('ABC') 
['A', 'B', 'C'] 
>>> split_on_uppercase('ABCD', True) 
['ABCD'] 
>>> split_on_uppercase('') 
[''] 
>>> split_on_uppercase('hello world') 
['hello world']

來源

2016-11-02 14:37:25 Totoro

這是可能的more_itertools.split_before工具的能力。

import more_itertools as mit 

iterable = "TheLongAndWindingRoad" 
[ "".join(i) for i in mit.split_before(iterable, lambda s: s.isupper())] 
# ['The', 'Long', 'And', 'Winding', 'Road']

還應拆分單個事件，即，從'ABC'我想獲得['A', 'B', 'C']。

iterable = "ABC" 
[ "".join(i) for i in mit.split_before(iterable, lambda s: s.isupper())] 
# ['A', 'B', 'C']

more_itertools是一個第三方包60+有用的工具，包括所有原始itertools recipes的，這避免了手動執行的實現。

來源

2017-08-24 17:20:59 pylang

以大寫字母拆分字符串

回答

相關問題