我有一個DataFrame有2列字符串,從tsv文件導入。兩列都需要轉換爲ascii。 (這是因爲我想在scikit-learn中通過CountVectorizer和TfidfTransformer管道傳遞文本)。如何將2個DataFrame列轉換爲ASCII碼?
我已經經歷了幾十個職位既在stackoverflow以及外面,但不能找出這一個。我的代碼如下,包括我嘗試過的一些東西。
任何建議,使其工作?
# tried including adding encoding="utf-8", does not work
df = pd.read_csv(questions, usecols = [3, 4, 5], nrows = 10, header=0, sep="\t")
y = df["is_duplicate"].values
X = df.drop("is_duplicate", axis=1).values
for col in X:
X = X.encode('utf-8') # does not work
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
def flat_list(my_list):
return [str(item) for sublist in my_list for item in sublist]
def transform_data(trans_obj_list,dataset_splits):
X_train = dataset_splits[0].astype(str)
X_train = flat_list(X_train)
for trfs in trans_obj_list:
transformed_vector = trfs().fit(X_train)
for x in range(0,len(dataset_splits)):
dataset_splits[x] =flat_list(dataset_splits[x].astype(str))
return dataset_splits
new_X_train, new_X_test = transform_data([CountVectorizer,TfidfTransformer],
[X_train, X_test])
請檢查我的答案,否則,請您分享您的數據樣本? – MedAli