2016-11-27 101 views
-1

我訓練創建推薦系統。我從網站獲取數據http://grouplens.org/datasets/movielens/指數5688超出範圍爲0軸的大小爲3706

import numpy as np 
import pandas as pd 
header = ['user_id', 'item_id', 'rating', 'timestamp'] 
df = pd.read_csv('ml-1m/ratings.dat', sep='::', names=header) 
n_users = df.user_id.unique().shape[0] 
n_items = df.item_id.unique().shape[0] 
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)) 

用戶數= 6040 |電影的數量= 3706

from sklearn import cross_validation as cv 
train_data, test_data = cv.train_test_split(df, test_size=0.25) 

,我嘗試建立兩個用戶 - 項目矩陣,一個用於訓練,而另一個用於測試

train_data_matrix = np.zeros((n_users, n_items)) 
for line in train_data.itertuples(): 
    train_data_matrix[line[1]-1, line[2]-1] = line[3] 

test_data_matrix = np.zeros((n_users, n_items)) 
for line in test_data.itertuples(): 
    test_data_matrix[line[1]-1, line[2]-1] = line[3] 

,我得到(全回溯)

IndexError        Traceback (most recent call last) 
<ipython-input-39-180dea01cdf8> in <module>() 
     2 train_data_matrix = np.zeros((n_users, n_items)) 
     3 for line in train_data.itertuples(): 
----> 4  train_data_matrix[line[1]-1, line[2]-1] = line[3] 
     5 
     6 test_data_matrix = np.zeros((n_users, n_items)) 

IndexError: index 5688 is out of bounds for axis 0 with size 3706 

有什麼不對?

P.S.

train_data.head() 
     user_id item_id rating  timestamp 
483019 2968 2268 5  971107926 
943582 5689 3615 3  963719230 
116153 752  1147 5  975458000 
103250 686  1704 5  975601762 
235333 1425 3752 4  1023560349 

PSS

for line in train_data.itertuples(): 
    print (line) 
Pandas(Index=483019, user_id=2968, item_id=2268, rating=5, timestamp=971107926) 
Pandas(Index=943582, user_id=5689, item_id=3615, rating=3, timestamp=963719230) 
Pandas(Index=116153, user_id=752, item_id=1147, rating=5, timestamp=975458000) 
Pandas(Index=103250, user_id=686, item_id=1704, rating=5, timestamp=975601762) 

回答

0

錯誤消息告訴我們,train_data_matrix具有形狀(3706,N),而line[1]-1是5688.

IndexError: index 5688 is out of bounds for axis 0 with size 3706 
train_data_matrix[line[1]-1, line[2]-1] = line[3] 

所以,問題是 - 這是爲什麼是line[1]等於5689?或在更大的背景下,爲什麼用這個值大train_data.itertuples()生產線?

我想知道你是否應該改爲使用

train_data_matrix[line[0]-1, line[1]-1] 

我不熟悉itertuples。什麼是line的要素是什麼?什麼是train_data完整形狀?

+0

train_data_matrix - 唯一值用戶與電影的id的矩陣。 5689 - 這是用戶的ID train_data.head() – Edward

+0

我回答了我的問題 – Edward

+0

但矩陣的行由行數,而不是用戶ID索引。 – hpaulj

相關問題