有了:
import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()
file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()
我看到:
$ python3 stack44661467.py
['ø' 'æ' 'å']
some text with ø, æ, å
也就是說h5py
沒有看到/解釋字符串爲Unicode - 寫入和讀取。
隨着dump工具:
$ h5dump deleteme.hdf5
HDF5 "deleteme.hdf5" {
GROUP "/" {
DATASET "text" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { (3)/(3) }
DATA {
(0): "\37777777703\37777777670", "\37777777703\37777777646",
(2): "\37777777703\37777777645"
}
ATTRIBUTE "1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
}
}
}
}
}
注意,在這兩種情況下,datatype
標記UTF8
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
這就是文檔說:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
它們可以存儲Python unicode字符串可以存儲的任何字符,NULL值除外。在文件中,它們被創建爲字符集爲H5T_CSET_UTF8的可變長度字符串。
讓h5py
(或其他讀者)擔心將\37777777703\37777777670
解釋爲適當的unicode字符。
用Python3'h5py'讀取字符看起來很好。我確實用'h5dump'來看你的代碼。 – hpaulj
'h5dump'也顯示該字符串的'DATATYPE'是'CSET H5T_CSET_UTF8;' – hpaulj