2015-10-07 76 views
0

我試圖從http://agmarknet.nic.in/解析商品定價數據並試圖將其存儲在我的數據庫中。如何避免在python中用空格拆分名稱

我正在以Ambala Cantt的形式獲取數據。 1.2苦瓜1200 2000 1500,然後我很分裂它通過分裂()並將其存儲在DB.But一些名字已經爲分裂他們的名字之間的空隙()過於分裂,並打破它爲:

['Ambala' ,'Cantt.', '1.2', 'Bitter', 'Gourd', '1200', '2000', '1500'] 

但我想這是因爲:

['Ambala Cantt.', '1.2', 'Bitter Gourd', '1200', '2000', '1500'] 

我循環中的數據在每個循環,然後分裂it.To解決這個問題我想正則表達式作爲

([c.strip() for c in re.match(r""" 
     (?P<market>[^0-9]+) 
     (?P<arrivals>[^ ]+) 
     (?P<variety>[^0-9]+) 
     (?P<min>[0-9]+) 
     \ (?P<max>[0-9]+) 
     \ (?P<modal>[0-9]+)""", 
     example, 
     re.VERBOSE 
    ).groups()]) 

上面的代碼塊工作正常,如果我寫example =「Ambala Cantt。 1.2苦瓜1200 2000 1500",但如果你把它裏面的每個環路**例如Y:

([c.strip() for c in re.match(r""" 
    (?P<market>[^0-9]+) 
    (?P<arrivals>[^ ]+) 
    (?P<variety>[^0-9]+) 
    (?P<min>[0-9]+) 
    \ (?P<max>[0-9]+) 
    \ (?P<modal>[0-9]+)""", 
    example, 
    re.VERBOSE 
).groups()]) 

爲** re.VERBOSE AttributeError的。我正在一個屬性錯誤: 「NoneType」對象有沒有屬性「組」。我的代碼看起來像這樣

params = urllib.urlencode({'cmm': 'Bitter gourd', 'mkt': '', 'search': ''}) 
    headers = {'Cookie': 'ASPSESSIONIDCCRBQBBS=KKLPJPKCHLACHBKKJONGLPHE; ASP.NET_SessionId=kvxhkhqmjnauyz55ult4hx55; ASPSESSIONIDAASBRBAS=IEJPJLHDEKFKAMOENFOAPNIM','Origin': 'http://agmarknet.nic.in', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36', 'Content-Type': 'application/x-www-form-urlencoded','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Cache-Control': 'max-age=0','Referer': 'http://agmarknet.nic.in/mark2_new.asp','Connection': 'keep-alive'} 
    conn = httplib.HTTPConnection("agmarknet.nic.in") 
    conn.request("POST", "/SearchCmmMkt.asp", params, headers) 
    response = conn.getresponse() 
    data = response.read() 
    soup = bs(data, "html.parser") 
    #print dir(soup) 
    z = [] 
    y = [] 
    w = [] 
    x1 = [] 
    test = [] 
    trs = soup.findAll("tr") 
    for tr in trs: 
     c = unicodedata.normalize('NFKD', tr.text) 
     y.append(str(c)) 
    for x in y: 
     #data1 = "Ambala 1.2 Onion 1200 2000 1500" 
     x1 = ([c.strip() for c in re.match(r""" 
      (?P<market>[^0-9]+) 
      (?P<arrivals>[^ ]+) 
      (?P<variety>[^0-9]+) 
      (?P<min>[0-9]+) 
      \ (?P<max>[0-9]+) 
      \ (?P<modal>[0-9]+)""", 
      x, 
      re.VERBOSE 
     ).groups()]) 
    print x1. 

誰能幫我,我怎麼能得到[形式我的數據「安巴拉Cantt。」,「1.2」, '苦瓜','1200','2000','1500'],而不是將它作爲['Amb ala','Cantt','1.2','苦','葫蘆','1200','2000','1500']。

+0

我已經更新了我的解決方案 - shlex會爲你做它 – LetzerWille

+0

是數據不是真的XML?爲什麼不使用XML解析器呢? – xtofl

回答

1
use shlex module 

import shlex 

l = "Ambala Cantt. 1.2 Bitter Gourd 1200 2000 1500" 
# first put quotes around word pairs 
l = re.sub(r'([A-Z]\w+\s+\w+)',r'"\1"',l) 
# then split with shlex, it will not split inside the quoted strings 
l = shlex.split(l) 

['Ambala Cantt.', '1.2', 'Bitter Gourd', '1200', '2000', '1500'] 

可以作爲一個班輪運行:

result = shlex.split(re.sub(r'([A-Z]\w+\s+\w+)',r'"\1"',"Ambala Cantt. 1.2 Bitter Gourd 1200 2000 1500")) 
+0

OP不想拆分名稱等,所以你不應該分裂'苦瓜' – FallenAngel

+0

@感謝您的注意。固定 – LetzerWille