MariaDB重複被插入

我有以下Python代碼來檢查MariaDB記錄是否已經存在，然後插入。但是，我正在插入重複項。代碼有什麼問題，還是有更好的方法來做到這一點？我是使用Python-MariaDB的新手。MariaDB重複被插入

import mysql.connector as mariadb 
from hashlib import sha1 

mariadb_connection = mariadb.connect(user='root', password='', database='tweets_db') 

# The values below are retrieved from Twitter API using Tweepy 
# For simplicity, I've provided some sample values 
id = '1a23bas' 
tweet = 'Clear skies' 
longitude = -84.361549 
latitude = 34.022003 
created_at = '2017-09-27' 
collected_at = '2017-09-27' 
collection_type = 'stream' 
lang = 'us-en' 
place_name = 'Roswell' 
country_code = 'USA' 
cronjob_tag = 'None' 
user_id = '23abask' 
user_name = 'tsoukalos' 
user_geoenabled = 0 
user_lang = 'us-en' 
user_location = 'Roswell' 
user_timezone = 'American/Eastern' 
user_verified = 1 
tweet_hash = sha1(tweet).hexdigest() 

cursor = mariadb_connection.cursor(buffered=True) 
cursor.execute("SELECT Count(id) FROM tweets WHERE tweet_hash = %s", (tweet_hash,)) 
if cursor.fetchone()[0] == 0: 
    cursor.execute("INSERT INTO tweets(id,tweet,tweet_hash,longitude,latitude,created_at,collected_at,collection_type,lang,place_name,country_code,cronjob_tag,user_id,user_name,user_geoenabled,user_lang,user_location,user_timezone,user_verified) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", (id,tweet,tweet_hash,longitude,latitude,created_at,collected_at,collection_type,lang,place_name,country_code,cronjob_tag,user_id,user_name,user_geoenabled,user_lang,user_location,user_timezone,user_verified)) 
    mariadb_connection.commit() 
    cursor.close() 
else: 
    cursor.close() 
    return

以下是表格的代碼。

CREATE TABLE tweets (
    id VARCHAR(255) NOT NULL, 
    tweet VARCHAR(255) NOT NULL, 
    tweet_hash VARCHAR(255) DEFAULT NULL, 
    longitude FLOAT DEFAULT NULL, 
    latitude FLOAT DEFAULT NULL, 
    created_at DATETIME DEFAULT NULL, 
    collected_at DATETIME DEFAULT NULL, 
    collection_type enum('stream','search') DEFAULT NULL, 
    lang VARCHAR(10) DEFAULT NULL, 
    place_name VARCHAR(255) DEFAULT NULL, 
    country_code VARCHAR(5) DEFAULT NULL, 
    cronjob_tag VARCHAR(255) DEFAULT NULL, 
    user_id VARCHAR(255) DEFAULT NULL, 
    user_name VARCHAR(20) DEFAULT NULL, 
    user_geoenabled TINYINT(1) DEFAULT NULL, 
    user_lang VARCHAR(10) DEFAULT NULL, 
    user_location VARCHAR(255) DEFAULT NULL, 
    user_timezone VARCHAR(100) DEFAULT NULL, 
    user_verified TINYINT(1) DEFAULT NULL 
);

來源

2017-09-26 Ham Sam

我們可以看到'SHOW CREATE TABLE mytable'和實際生成的SQL。 –

當然，我已經用實際的代碼片段和CREATE TABLE語法更新了這個問題，謝謝 –

如果您正在尋找獨特的推文，請使用'tweet''UNIQUE'或至少'INDEXed'。「散列」只會增加複雜性。 –

向tweet_has提交添加唯一常量。

alter table tweets modify tweet_hash varchar(255) UNIQUE ;

來源

2017-09-27 14:46:13 sfgroups

我還必須在Python代碼中添加一個異常，以忽略重複插入錯誤：'除了mariadb.IntegrityError' –

每個表應該有一個PRIMARY KEY。 id應該是這樣嗎？（CREATE TABLE不是這麼說的。）根據定義，PK是UNIQUE，所以在插入重複項時會導致錯誤。

同時：

爲什麼有tweet_hash？索引tweet。
不要說255當有特定的限制小於那個。
user_id和user_name應該在另一個「查找」表中，而不是在這個表中。
user_verified是否屬於user？或者每個推文？
如果您預計有數百萬條推文，則需要將此表縮小並編制索引 - 否則您會遇到性能問題。

來源

2017-09-27 15:26:39

感謝您的好處，這裏有一些基於您的建議的推理和修改。 1）'tweet_hash'允許快速查找，而不是實際搜索包含多個單詞的全文字符串。 2）我將'tweet_hash'大小減少到了50，你已經正確地指出，列的大小沒有優化。 3）你是對的，我應該重新構建這個有2個表格。 4）我有索引，但是，它確實需要變得更小 –

「哈希」的隨機性使其不會更快。不必要的列和索引的開銷大於抵消任何優勢。 –

MariaDB重複被插入

回答

相關問題