2016-11-23 75 views
0

我從一個CSV文件中獲取數據,使用data = numpy.recfromtxt('table.csv', delimiter=';', dtype=str)Python的很長的字符串在numpy的陣列

表看起來是這樣的:

Name; Birthdate; Biography 
John; 1990; Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo? 

Python和NumPy的似乎與此長的問題字符串。 任何想法如何解決這個問題?

+3

你說的是什麼樣的_problems_的?你應該澄清一點。 – Lafexlos

+0

'recfromtxt'使用更常見的'genfromtxt'。第一行有2個分隔符。第二個有3.你期望有多少個領域? – hpaulj

回答

1

您可以使用Python的pandas包。

下面是使用它的一個簡單的想法:

import pandas as pd 

data = pd.read_csv("file.csv", delimiter = ";") 

希望這是你想要的...

+0

這是什麼產生的? – hpaulj

0

請使用熊貓包從CSV閱讀

import pandas as pd 
    data = pd.read_csv('table.csv') 

熊貓能處理長字符串也是如此。

0

我沒有問題閱讀,所以也許你的問題可能是關於格式化適合打印的方式。這裏有幾個選項。

>>> import textwrap 
>>> a = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo?" 
>>> txt = textwrap.wrap(a, width=70) 
>>> print(("{}\n"*len(txt)).format(*txt)) 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo 
intuens debet institutum illud quasi signum absolvere. Scrupulum, 
inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a 
Chryippo. Quo tandem modo? 

或許這一個...

>>> txt2 = "\n".join([i for i in txt]) 
>>> print(txt2) 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo 
intuens debet institutum illud quasi signum absolvere. Scrupulum, 
inquam, abeunti; Quae diligentissime contra Aristonem dicuntur a 
Chryippo. Quo tandem modo? 
>>>  
0

的錯誤是:

In [67]: np.recfromtxt('stack40765849.txt', delimiter=';', dtype=str) 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-67-eab6d3192d4d> in <module>() 
----> 1 np.recfromtxt('stack40765849.txt', delimiter=';', dtype=str) 

/usr/lib/python3/dist-packages/numpy/lib/npyio.py in recfromtxt(fname, **kwargs) 
    1949  kwargs.setdefault("dtype", None) 
    1950  usemask = kwargs.get('usemask', False) 
-> 1951  output = genfromtxt(fname, **kwargs) 
    1952  if usemask: 
    1953   from numpy.ma.mrecords import MaskedRecords 
... 
ValueError: Some errors were detected ! 
    Line #2 (got 4 columns instead of 3) 

(注意,recfromtxt是使用genfromtxt,它討論了很多

問題不在於字符串的長度,而在於分隔符的數量。第一行(一個heade r?)有2個,表明你想要3列或者字段。但第二行有3個;額外的可能是文本的一部分。

識別第一行的字段名稱會導致相同的錯誤。

np.recfromtxt('stack40765849.txt', delimiter=';', dtype=str,names=True) 

pandas負載的情況下爲:

In [74]: data=pandas.read_csv('stack40765849.txt',delimiter=';') 
In [75]: data 
Out[75]: 
     Name           Birthdate \ 
John 1990 Lorem ipsum dolor sit amet, consectetur adipi... 

               Biography 
John Quae diligentissime contra Aristonem dicuntur... 

它不給一個錯誤,但它看起來不正確。

==================

如果我在文本改變;.

In [82]: np.genfromtxt('stack40765849_1.txt', delimiter=';', dtype=None,names=Tr 
    ...: ue) 
Out[82]: 
array((b'John', 1990, b' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti. Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo?'), 
     dtype=[('Name', 'S4'), ('Birthdate', '<i4'), ('Biography', 'S225')]) 

我得到一個結構數組(幾乎像一個recarray)與3個領域;最後是很長的 - 全文。 (b'...'表示Py3中的一個字節字符串;它不會出現在Py2顯示中。)

pandas產生類似的東西:

In [83]: data=pandas.read_csv('stack40765849_1.txt',delimiter=';') 
In [84]: data 
Out[84]: 
    Name Birthdate           Biography 
0 John  1990 Lorem ipsum dolor sit amet, consectetur adipi... 

正確PY3 unicode的負荷:

In [91]: np.recfromtxt('stack40765849_1.txt', delimiter=';', dtype='U4,i,U255',n 
    ...: ames=True) 
Out[91]: 
rec.array(('John', 1990, ' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hanc ergo intuens debet institutum illud quasi signum absolvere. Scrupulum, inquam, abeunti. Quae diligentissime contra Aristonem dicuntur a Chryippo. Quo tandem modo?'), 
      dtype=[('Name', '<U4'), ('Birthdate', '<i4'), ('Biography', '<U255')]) 
In [92]: