具有列訪問權限的2D Numpy字符串數組

Python和Numpy和MatPlotLib的新增功能。具有列訪問權限的2D Numpy字符串數組

我想從各種數據類型的CSV創建2D Numpy數組，但我會將它們全部視爲字符串。殺手是我需要能夠訪問他們與元組索引，如：[：，5]獲得第五列，或[5]獲得第五行。有沒有辦法做到這一點？由於內存訪問計算，似乎這是Numpy的限制。

dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,") 
print dataSet[:, 4] <---results in IndexError: Invalid Index

我已經treind loadfromgen，我試過dtype = str和dtype =「a16」和dtype = object。什麼都沒有我可以加載數據，但它不具有列訪問權限，或者根本無法加載數據。 GRRRR ...

來源

2016-01-22 unwrittenrainbow

你的分隔符是'「，」'。這實際上是什麼分隔輸入文件的每一行中的元素？一個空格，*然後*一個逗號？ – user2357112

是的，還有一個空間。我可以使用逗號，但沒關係。編輯：或者至少我不認爲這很重要.... – unwrittenrainbow

什麼是'dataSet.shape'和'dataSet.dtype'？ –

模擬您從註釋行文件 - 複製幾個時間（每個文件的行即一個字符串）：

In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K" 
In [9]: txt = [txt for _ in range(5)] 

In [10]: txt 
Out[10]: 
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K', 
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K', 
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K', 
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K', 
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']

負載與genfromtxt，用分隔符。讓它選擇每列的最佳D型：

In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None) 
In [13]: A 
Out[13]: 
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'), 
     (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...], 
     dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])

5元件陣列與一個字段名的化合物D型細胞

In [14]: A.shape 
Out[14]: (5,) 
In [15]: A.dtype 
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), 
    ('f3', 'S10'), ('f4', '<i4'), ....])

訪問一個「列」（未列號）

In [16]: A['f4'] 
Out[16]: array([13, 13, 13, 13, 13])

或加載爲dtype = str：

In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str) 
In [18]: A 
Out[18]: 
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13', 
     ' Never-married', ' Adm-clerical', ' Not-in-family', ' White', 
     ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'], 
     ... 
     ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']], 
     dtype='<U14') 
In [19]: A.dtype 
Out[19]: dtype('<U14') 
In [20]: A.shape 
Out[20]: (5, 15) 
In [21]: A[:,4] 
Out[21]: 
array([' 13', ' 13', ' 13', ' 13', ' 13'], 
     dtype='<U14')

現在它是15列2d數組，可以用列號編索引。

隨着錯誤分隔符，且它加載每行

In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str) 
In [25]: A 
Out[25]: 
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K', 
     ...], 
     dtype='<U127') 
In [26]: A.shape 
Out[26]: (5,)

一維數組具有長串D型細胞一列。

CSV文件可能以各種方式加載，有些是故意的，有些則不是。您必須查看結果，並在盲目嘗試索引列之前嘗試理解它們。

來源

2016-01-23 02:34:56 hpaulj

哇！非常感謝所有這些工作。通過它，我能夠解決它，現在一切都很開心。 :) – unwrittenrainbow

具有列訪問權限的2D Numpy字符串數組

回答

相關問題