2017-04-09 58 views
-3

嗨〜我試圖標記Facebook格式的Facebook評論時出現問題。我準備好了我的CSV數據,並且我完成了讀取文件的工作。我是Python和數據挖掘的新手。有關於分詞器和數據類型問題的問題

我正在使用Anaconda3; Python 3.5。 (我的CSV數據在行中的cols約20K和1)

的代碼,

import csv 
from nltk import sent_tokenize, word_tokenize as sent_tokenize, word_tokenize 
with open('facebook_comments_samsung.csv', 'r') as f: 
    reader = csv.reader(f) 
    your_list = list(reader) #list(reader) 
print (your_list) 

什麼來了,結果是這樣的:


[['comment_message'], ['b"Yet again been told a pack of lies by Samsung Customer services who have lost my daughters phone and couldn\'t care less. ANYONE WHO PURCHASES ANYTHING FROM THIS COMPANY NEEDS THEIR HEAD TESTED"'], ["b'You cannot really blame an entire brand worldwide for a problem caused by a branch. It is a problem yes, but address your local problem branch'"], ["b'Haha!! Sorry if they lost your daughters phone but I will always buy Samsung products no matter what.'"], ["b'Salim Gaji BEST REPLIE EVER \\xf0\\x9f\\x98\\x8e'"], ["b'<3 Bewafa zarge <3 \\r\\n\\n \\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\r\\n\\xf0\\x9f\\x8e\\xad\\xf0\\x9f\\x91\\x89 AQIB-BOT.ML \\xf0\\x9f\\x91\\x88\\xf0\\x9f\\x8e\\xadMANUAL\\xe2\\x99\\xaaKing.Bot\\xe2\\x84\\xa2 \\r\\n\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x94\\xe2\\x80\\x93\\xe2\\x80\\x94'"], ["b'\\xf0\\x9f\\x8c\\x90 LATIF.ML \\xf0\\x9f\\x8c\\x90'"], ['b"I\'m just waiting here patiently for you guys to say that you\'ll be releasing the s8 and s8+ a week early, for those who pre-ordered. Wishful thinking \\xf0\\x9f\\x98\\x86. Can\'t wait!"'], ['b"That\'s some good positive thinking there sir."'], ["b'(y) #NextIsNow #DoWhatYouCant'"], ["b'looking good'"], ['b"I\'ve always thought that when I first set eyes on my first born that I\'d like it to be on the screen of a cameraphone at arms length rather than eye-to-eye while holding my child. Thank you Samsung for improving our species."'], ["b'cool story'"], ["b'I believe so!'"], ["b'superb'"], ["b'Nice'"], ["b'thanks for the share'"], ["b'awesome'"], ["b'How can I talk to Samsung'"], ["b'Wow'"], ["b'#DoWhatYouCant siempre grandes innovadores Samsung Mobile'"], ["b'I had a problem with my s7 edge when I first got it all fixed now. However when I went to the Samsung shop they were useless and rude they refused to help and said there is nothing they could do no wonder the shop was dead quiet'"], ["b'Zeeshan Khan Masti Khel'"], ["b'I dnt had any problem wd my phn'"], ["b'I have maybe just had a bad phone to start with until it got fixed eventually. I had to go to carphone warehouse they were very helpful'"], ["b'awesome'"], ["b'Ch Shuja Uddin'"], ["b'akhheeerrr'"], ["b'superb'"], ["b'nice story'"], ["b'thanks for the share'"], ["b'superb'"], ["b'thanks for the share'"], ['b"On February 18th 2017 I sent my phone away to with a screen issue. The lower part of the screen was flickering bright white. The phone had zero physical damage to the screen\\n\\nI receive an email from Samsung Quotations with a picture of my SIM tray. Upon phoning I was told my SIM tray was stuck inside the phone and was handed a \\xc2\\xa392.14 repair bill. There is no way that my SIM tray was stuck in the phone as I removed my SIM and memory card before sending the phone away.\\n\\nAfter numerous calls I finally gave in and agreed to pay the \\xc2\\xa392.14 on the understanding that my screen repair would also be covered in this cost. This was confirmed to me by the person on the phone.\\n\\nOn 
  • 對不起,您所查找的結果不便。我的錯。

要繼續,我說,

tokens = [word_tokenize(i) for i in your_list] 
for i in tokens: 
print (i) 

print (tokens) 

這是我得到以下錯誤的部分:

C:\Program Files\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) in line 1278 TypeError: expected string or bytes-like object 

我想什麼接下來做的是,

import nltk 
en = nltk.Text(tokens) 

print(len(en.tokens)) 
print(len(set(en.tokens))) 
en.vocab() 
en.plot(50) 
en.count('galaxy s8') 

最後,我想繪製一個基於數據的wordcloud。

意識到每一秒鐘的時間都是寶貴的,我非常抱歉請求你的幫助。我一直在爲此工作幾天,並找不到適合我的問題的解決方案。謝謝你的閱讀。

+1

你好。歡迎來到Python開發。請避免像這樣命名你的問題。標題太長,包含一些不重要的信息。無論如何,我現在不會低估你的問題。我希望你很快得到答案。 – fameman

+0

@fameman Thx。我正在嘗試適應Stack。將盡我所能,併爲您的建議thx。 –

+0

對不起,我注意到瞭解釋原始數據的錯誤。數據由20k列和1列thx組成。 –

回答

0

您遇到的錯誤是因爲您的CSV文件轉換爲列表列表 - 文件中的每一行都有一個列表。該文件只包含一列,因此每個列表都有一個元素:包含要標記的消息的字符串。爲了讓過去的錯誤,利用該行,而不是解壓縮子列表:

tokens = [word_tokenize(row[0]) for row in your_list] 

之後,你需要多學習一些蟒蛇,並學習如何檢查程序和你的變量。

+0

感謝您的建議,我感激地通過了錯誤。但是,我仍然無法說「不可能的類型:列表」。 我的代碼如下: 類型(令牌)#列表/// en = nltk。Text(token)/// print(len(en.tokens))#19904 /// en.plot(50)#錯誤 –

+0

@Alex,'tokens'現在是一個單詞列表。 'nltk.Text'需要一個平坦的令牌序列。 [這裏](http://stackoverflow.com/q/952914/699305)是如何壓扁你的列表清單。 PS。請遵循我的建議並閱讀一些基本的Python教程(然後閱讀ntlk書,它會一步一步通過如何完成這些任務)。 – alexis