我有一個表示有向圖一套Django的ORM模型,我試圖找回所有的相鄰頂點給定的頂點忽略的邊緣方向:查詢檢索鄰居太慢
class Vertex(models.Model):
pass
class Edge(models.Model):
orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
# ... other data about this edge ...
的查詢Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()
返回正確的結果,但在我的情況下,執行需要很長的時間。
通常,對於我的應用程序,在任何給定時間都會有大約50-100個頂點,並且大約有一百萬條邊。即使它減少到只有20個頂點和100000層的邊緣,該查詢需要大約一分半鐘的執行:
for i in range(20):
Vertex().save()
vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()
v = vxs[0]
def f1():
return list(Vertex.objects.filter(
Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct())
t1 = timeit.Timer(f1)
print(t1.timeit(number=1)) # 84.21138522100227
在另一方面,如果我起來將查詢分爲兩個部分,我可以得到完全相同的導致只有毫秒屈指可數:
def f2():
q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
return list({x for x in itertools.chain(q1, q2)})
t2 = timeit.Timer(f2)
print(t2.timeit(number=100)/100) # 0.0109818680600074
這第二個版本雖然有問題:
- 這不是原子。邊界列表幾乎可以保證在我的應用程序中的兩個查詢之間發生變化,這意味着結果不會代表單個時間點。
- 我無法對結果執行額外的處理和聚合,無需手動循環。 (例如,如果我想要
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field'))
)
爲什麼第二個版本比第一個版本跑得快得多? 我該如何做到這一點,有沒有辦法讓第一個跑得足夠快,以供實際使用?
我目前正在使用Python 3.5.2,PostgreSQL 9.5.6和Django 1.11進行測試,但是如果這是我遇到的問題之一。
這裏是由每個查詢生成的SQL,以及Postgres的的explan:
第一個:
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061)
HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
第二的:
Vertex.objects.filter(edge_orig__dest=v).distinct()
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061
HashAggregate (cost=1531.42..1531.62 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=848.11..1519.04 rows=4950 width=4)
Hash Cond: (app_edge.orig_id = app_vertex.id)
-> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4)
Recheck Cond: (dest_id = 1061)
-> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0)
Index Cond: (dest_id = 1061)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
@ khampson的版本也需要一分半鐘才能運行,所以它也是不可行的。
Vertex.objects.raw(...)
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Join (cost=1.45..2917.45 rows=100000 width=12)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
你能分享兩個版本正在生成的SQL嗎? – Hamms
@Hamms增加了SQL和Postgres對每個查詢的解釋。 – AJMansfield
我可能會誤解你的模型(實際上,我可能是),但你爲什麼要過濾'edge_orig__dest = v | edge_orig__dest = v',而不僅僅是'edge_dest = v | edge_origt = v'? – Hamms