2016-03-22 24 views
1

我一直在試圖運行文件blei_lda.py從第4章的建築機器學習系統與Python沒有成功。我正在使用Enthought Canopy GUI的Python 2.7。以下是創作者提供的實際文件,但在github上也有多個副本。有人可以解釋我在從使用Python構建機器學習系統運行blei_lda.py文件時遇到的不受支持的操作數錯誤嗎?

github repository

的問題是我不斷收到此錯誤:

TypeError         Traceback (most recent call last) 
c:\users\matt\desktop\pythonprojects\pml\ch04\blei_lda.py in <module>() 
    for ti in range(model.num_topics): 
     words = model.show_topic(ti, 64) 
------>tf = sum(f for f, w in words) 
     with open('topics.txt', 'w') as output: 
     output.write('\n'.join('{}:{}'.format(w, int(1000. * f/tf)) for f, w in words)) 
     output.write("\n\n\n") 

TypeError: unsupported operand type(s) for +: 'int' and 'unicode' 

我試圖圍繞創建工作,但無法找到任何工作完全。

我也搜遍了網絡和堆棧溢出的解決方案,但它似乎是我唯一遇到麻煩運行此文件的人。

# This code is supporting material for the book 
# Building Machine Learning Systems with Python 
# by Willi Richert and Luis Pedro Coelho 
# published by PACKT Publishing 
# 
# It is made available under the MIT License 

from __future__ import print_function 
from wordcloud import create_cloud 
try: 
    from gensim import corpora, models, matutils 
except: 
    print("import gensim failed.") 
    print() 
    print("Please install it") 
    raise 

import matplotlib.pyplot as plt 
import numpy as np 
from os import path 

NUM_TOPICS = 100 

# Check that data exists 
if not path.exists('./data/ap/ap.dat'): 
    print('Error: Expected data to be present at data/ap/') 
    print('Please cd into ./data & run ./download_ap.sh') 

# Load the data 
corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt') 

# Build the topic model 
model = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None) 

# Iterate over all the topics in the model 
for ti in range(model.num_topics): 
    words = model.show_topic(ti, 64) 
    tf = sum(f for f, w in words) 
    with open('topics.txt', 'w') as output: 
     output.write('\n'.join('{}:{}'.format(w, int(1000. * f/tf)) for f, w in words)) 
     output.write("\n\n\n") 

# We first identify the most discussed topic, i.e., the one with the 
# highest total weight 

topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics) 
weight = topics.sum(1) 
max_topic = weight.argmax() 


# Get the top 64 words for this topic 
# Without the argument, show_topic would return only 10 words 
words = model.show_topic(max_topic, 64) 

# This function will actually check for the presence of pytagcloud and is otherwise a no-op 
create_cloud('cloud_blei_lda.png', words) 

num_topics_used = [len(model[doc]) for doc in corpus] 
fig,ax = plt.subplots() 
ax.hist(num_topics_used, np.arange(42)) 
ax.set_ylabel('Nr of documents') 
ax.set_xlabel('Nr of topics') 
fig.tight_layout() 
fig.savefig('Figure_04_01.png') 


# Now, repeat the same exercise using alpha=1.0 
# You can edit the constant below to play around with this parameter 
ALPHA = 1.0 

model1 = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA) 
num_topics_used1 = [len(model1[doc]) for doc in corpus] 

fig,ax = plt.subplots() 
ax.hist([num_topics_used, num_topics_used1], np.arange(42)) 
ax.set_ylabel('Nr of documents') 
ax.set_xlabel('Nr of topics') 

# The coordinates below were fit by trial and error to look good 
ax.text(9, 223, r'default alpha') 
ax.text(26, 156, 'alpha=1.0') 
fig.tight_layout() 
fig.savefig('Figure_04_02.png') 

回答

0

在這一行:words = model.show_topic(ti, 64),詞語是元組(unicode,float64)

例如列表。 [(u'school', 0.029515796999228502),(u'prom', 0.018586355008452897)]

所以在這一行tf = sum(f for f, w in words) f表示unicode,而w表示float值。並且您正試圖總結提供不受支持的操作數類型錯誤的unicode值。

將此行修改爲tf = sum(f for w, f in words),因此它現在將浮點值相加。

也出於同樣的原因修改此行output.write('\n'.join('{}:{}'.format(w, int(1000. * f/tf)) for w, f in words))

因此,代碼片斷將看起來像:

for ti in range(model.num_topics): 
     words = model.show_topic(ti, 64) 
     tf = sum(f for w, f in words) 
     with open('topics.txt', 'w') as output: 
     output.write('\n'.join('{}:{}'.format(w, int(1000. * f/tf)) for w, f in words)) 
     output.write("\n\n\n") 
相關問題