2016-07-07 92 views
2

關於如何在常用的bytearray字段上加入兩個熊貓數組的任何想法?源(Teradata)中的字段是一個實際的ByteArray,並且從Teradata一方,這不能強制爲Teradata以外的字符或可用內容)pandas:在ByteArray列上合併

Teradata Export精美地讀入熊貓的陣列。但是我無法將兩個表與通常命名的字段(DatabaseId)合併,其中該字段是一個字節陣列。

(導入這兩種熊貓作爲PD和itertools)

當我嘗試的簡單合併:

merge1 = pd.merge(tvm, dbase, on="DatabaseId") 

我得到的錯誤:

TypeError: type object argument after * must be a sequence, not itertools.imap 

我搜索的StackOverflow,發現a similar problem for joining on a cell containing a collection

dbase['DBID'] = dbase.DatabaseId.apply(lambda r: type(sorted(r.iteritems()))) 

但我得到的錯誤:

AttributeError: 'bytearray' object has no attribute 'iteritems' 

更新數據 的

例收集的數據通過熊貓使用

dbase = pd.read_sql('select databaseid, databasename from ud812.dbase sample 10', conn) 
conn is a connection to a teradata database 

數據類型出來的Teradata的類型爲VARCHAR所有專欄除外:

DatabaseID = bytearray (Byte(4)) 
TVMID = bytearray (Byte(4)) 

>>> dbase.dtypes 
DatabaseId  object 
DatabaseName object 
dtype: object 
>>> dbase 
     DatabaseId   DatabaseName 
0 [2, 0, 243, 185] PCDW_CRS_BBCONV3_TB 
1 [2, 0, 168, 114]   PAMLIF_TB 
2 [2, 0, 133, 153]  PADW_PRESN_TB 
3 [2, 0, 29, 184]  CEDW_MOBILE_TB 
4 [2, 0, 190, 183] CEDW_MODEL_SCORE_TB 
5 [2, 0, 71, 55]   PBBBAM_TB 
6 [2, 0, 169, 183]   CEDW_OCC_TB 
7 [2, 0, 201, 183] CCDW_DGTL_DEAL_TB 
8 [0, 0, 139, 8]   PRECDSS_TB 
9 [2, 0, 142, 203]    CDBDW_TB 
>>> 
>>> 
>>> tvm.dtypes 
TVMId   object 
DatabaseId object 
TVMName  object 
TableKind  object 
CreateText object 
dtype: object 
>>> tvm 
         TVMId  DatabaseId      TVMName \ 
0 [230, 1, 41, 11, 0, 0] [2, 0, 67, 183]    JCP_03538_112002 
1 [214, 1, 60, 133, 0, 0] [2, 0, 186, 52]  STL_AUTHNCTD_RULE_EXECN 
2 [193, 2, 59, 48, 0, 0] [2, 0, 225, 150]  uye177_Xsell_EM_OPCL_TB2 
3 [0, 2, 235, 154, 0, 0] [2, 0, 244, 181] PL_CALCD_INVSTR_MTHLY_HIST_ST 
4 [255, 1, 131, 76, 0, 0] [2, 0, 110, 63]   IMH867_AVA0803_SNAP 
5 [125, 1, 217, 138, 0, 0] [2, 0, 237, 153]   FD_ACCT_STMT_ADR_ST 
6 [224, 0, 80, 233, 0, 0] [2, 0, 243, 127]    EXP_SRCH_RSLT_DESC 
7 [208, 1, 72, 15, 0, 0]  [2, 0, 8, 57]  SGI_PAY_DENIED_SEP_112012 
8 [246, 0, 27, 61, 0, 0] [2, 0, 143, 130]      CR_INDIVD 
9 [186, 1, 242, 167, 0, 0] [0, 0, 244, 18]     wzu448_sb_apps 

    TableKind           CreateText 
0   T            None 
1   V CREATE VIEW ... ... ... ... ... ... ... ... ... 
2   T            None 
3   V CREATE VIEW ... ... ... ... ... ... ... ... ... 
4   T            None 
5   V CREATE VIEW ... ... ... ... ... ... ... ... ... 
6   V CREATE VIEW ... ... ... ... ... ... ... ... ... 
7   V CREATE VIEW ... ... ... ... ... ... ... ... ... 
8   V CREATE VIEW ... ... ... ... ... ... ... ... ... 
9   T            None 
+0

「tvm」的類型是什麼?你能提供你的數據樣本嗎? –

+0

那麼,您可以使用FROM_BYTES函數將BYTE轉換爲字符串。這是醜陋的語法,因爲你必須使用LPAD(前導零被忽略)和TRANSLATE(結果是Unicode),CAST(LPAD返回一個VARCHAR(32000):'CAST(TRANSLATE(LPAD(FROM_BYTES(tvmid,'Base16')) ,12,'0')USING unicode_to_latin)AS VARCHAR(12)) '(** 12 **是字節數的兩倍) – dnoeth

回答

1

將您的bytearray轉換爲他們不可改變的表親bytes

import pandas as pd 

# Create your example `dbase` 
DatabaseId_dbase = list(map(bytearray, [[2, 0, 243, 185], [2, 0, 168, 114], 
    [2, 0, 133, 153], [2, 0, 29, 184], [2, 0, 190, 183], [2, 0, 71, 55], 
    [2, 0, 169, 183], [2, 0, 201, 183], [0, 0, 139, 8], [2, 0, 142, 203]])) 
DatabaseName = ['PCDW_CRS_BBCONV3_TB', 'PAMLIF_TB', 'PADW_PRESN_TB', 
    'CEDW_MOBILE_TB', 'CEDW_MODEL_SCORE_TB', 'PBBBAM_TB', 'CEDW_OCC_TB', 
    'CCDW_DGTL_DEAL_TB', 'PRECDSS_TB', 'CDBDW_TB'] 
dbase = pd.DataFrame({'DatabaseId': DatabaseId_dbase, 
         'DatabaseName': DatabaseName}) 

# Create your example `tvm` 
DatabaseId_tvm = list(map(bytearray, [[2, 0, 67, 183], [2, 0, 186, 52], 
    [2, 0, 225, 150], [2, 0, 244, 181], [2, 0, 110, 63], [2, 0, 237, 153], 
    [2, 0, 243, 127], [2, 0, 243, 185], [2, 0, 143, 130], [0, 0, 244, 18]])) 
TVMId = list(map(bytearray, [[230, 1, 41, 11, 0, 0], [214, 1, 60, 133, 0, 0], 
    [193, 2, 59, 48, 0, 0], [0, 2, 235, 154, 0, 0], [255, 1, 131, 76, 0, 0], 
    [125, 1, 217, 138, 0, 0], [224, 0, 80, 233, 0, 0], [208, 1, 72, 15, 0, 0], 
    [246, 0, 27, 61, 0, 0], [186, 1, 242, 167, 0, 0]])) 
TVMName = ['JCP_03538_112002', 'STL_AUTHNCTD_RULE_EXECN', 
    'uye177_Xsell_EM_OPCL_TB2', 'PL_CALCD_INVSTR_MTHLY_HIST_ST', 
    'IMH867_AVA0803_SNAP', 'FD_ACCT_STMT_ADR_ST', 'EXP_SRCH_RSLT_DESC', 
    'SGI_PAY_DENIED_SEP_112012', 'CR_INDIVD', 'wzu448_sb_apps'] 
TableKind = ['T', 'V', 'T', 'V', 'T', 'V', 'V', 'V', 'V', 'T'] 
tvm = pd.DataFrame({'DatabaseId': DatabaseId_tvm, 'TVMId': TVMId, 
        'TVMName': TVMName, 'TableKind': TableKind}) 

# This line would fail with the following error 
#  TypeError: type object argument after * must be a sequence, not map 
# merge = pd.merge(tvm, dbase, on='DatabaseId') 

# Apply the `bytes` constructor to the `bytearray` columns  
dbase['DatabaseId'] = dbase['DatabaseId'].apply(bytes) 
tvm['DatabaseId'] = tvm['DatabaseId'].apply(bytes) 
tvm['TVMId'] = tvm['TVMId'].apply(bytes) 

# Now it works! 
merge = pd.merge(tvm, dbase, on='DatabaseId') 

產生的merge

DatabaseId      TVMId     TVMName \ 
0 b'\x02\x00\xf3\xb9' b'\xd0\x01H\x0f\x00\x00' SGI_PAY_DENIED_SEP_112012 

    TableKind   DatabaseName 
0   V PCDW_CRS_BBCONV3_TB 

(我不得不改變你的tvm行之一的DatabaseId領域,否則merge本來是空的。我還沒有包括CreateText列—對於SO太尷尬)

+0

Worked Beautifully Thank you! –