查詢檢索鄰居太慢

我有一個表示有向圖一套Django的ORM模型，我試圖找回所有的相鄰頂點給定的頂點忽略的邊緣方向：查詢檢索鄰居太慢

class Vertex(models.Model): 
    pass 

class Edge(models.Model): 
    orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True) 
    dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True) 
    # ... other data about this edge ...

的查詢Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()返回正確的結果，但在我的情況下，執行需要很長的時間。

通常，對於我的應用程序，在任何給定時間都會有大約50-100個頂點，並且大約有一百萬條邊。即使它減少到只有20個頂點和100000層的邊緣，該查詢需要大約一分半鐘的執行：

for i in range(20): 
    Vertex().save() 

vxs = list(Vertex.objects.all()) 
for i in tqdm.tqdm(range(100000)): 
    Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save() 

v = vxs[0] 
def f1(): 
    return list(Vertex.objects.filter(
     Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()) 

t1 = timeit.Timer(f1) 

print(t1.timeit(number=1)) # 84.21138522100227

在另一方面，如果我起來將查詢分爲兩個部分，我可以得到完全相同的導致只有毫秒屈指可數：

def f2(): 
    q1 = Vertex.objects.filter(edge_orig__dest=v).distinct() 
    q2 = Vertex.objects.filter(edge_dest__orig=v).distinct() 
    return list({x for x in itertools.chain(q1, q2)}) 

t2 = timeit.Timer(f2) 
print(t2.timeit(number=100)/100) # 0.0109818680600074

這第二個版本雖然有問題：

這不是原子。邊界列表幾乎可以保證在我的應用程序中的兩個查詢之間發生變化，這意味着結果不會代表單個時間點。
我無法對結果執行額外的處理和聚合，無需手動循環。（例如，如果我想要Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field'))）

爲什麼第二個版本比第一個版本跑得快得多？我該如何做到這一點，有沒有辦法讓第一個跑得足夠快，以供實際使用？

我目前正在使用Python 3.5.2，PostgreSQL 9.5.6和Django 1.11進行測試，但是如果這是我遇到的問題之一。

這裏是由每個查詢生成的SQL，以及Postgres的的explan：

第一個：

Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)) 

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id") 
WHERE ("app_edge"."dest_id" = 1061 
     OR T4."orig_id" = 1061) 

HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4) 
    Group Key: app_vertex.id 
    -> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4) 
     Hash Cond: (app_vertex.id = app_edge.orig_id) 
     Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061)) 
     -> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8) 
       Hash Cond: (t4.dest_id = app_vertex.id) 
       -> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8) 
       -> Hash (cost=1.20..1.20 rows=20 width=4) 
        -> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4) 
     -> Hash (cost=1541.00..1541.00 rows=100000 width=8) 
       -> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)

第二的：

Vertex.objects.filter(edge_orig__dest=v).distinct() 

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
WHERE "app_edge"."dest_id" = 1061 

HashAggregate (cost=1531.42..1531.62 rows=20 width=4) 
    Group Key: app_vertex.id 
    -> Hash Join (cost=848.11..1519.04 rows=4950 width=4) 
     Hash Cond: (app_edge.orig_id = app_vertex.id) 
     -> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4) 
       Recheck Cond: (dest_id = 1061) 
       -> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0) 
        Index Cond: (dest_id = 1061) 
     -> Hash (cost=1.20..1.20 rows=20 width=4) 
       -> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)

@ khampson的版本也需要一分半鐘才能運行，所以它也是不可行的。

Vertex.objects.raw(...) 

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id") 
WHERE ("app_edge"."dest_id" = 1061 
     OR T4."orig_id" = 1061); 

HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4) 
    Group Key: app_vertex.id 
    -> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4) 
     Hash Cond: (app_vertex.id = app_edge.orig_id) 
     Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061)) 
     -> Hash Join (cost=1.45..2917.45 rows=100000 width=12) 
       Hash Cond: (t4.dest_id = app_vertex.id) 
       -> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8) 
       -> Hash (cost=1.20..1.20 rows=20 width=4) 
        -> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4) 
     -> Hash (cost=1541.00..1541.00 rows=100000 width=8) 
       -> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)

來源

2017-07-06 AJMansfield

你能分享兩個版本正在生成的SQL嗎？ – Hamms

@Hamms增加了SQL和Postgres對每個查詢的解釋。 – AJMansfield

我可能會誤解你的模型（實際上，我可能是），但你爲什麼要過濾'edge_orig__dest = v | edge_orig__dest = v'，而不僅僅是'edge_dest = v | edge_origt = v'？ – Hamms

這兩個查詢的查詢計劃完全不同。第一個（較慢）沒有觸及任何索引，並且正在執行兩個left join s，這兩個都會導致更多行被處理並返回。根據我對Django ORM語法意圖的理解，聽起來並不像你真的想在這裏做left join那樣。

我會建議考慮從Django的ORM內掉落下來到原SQL在這種情況下，和雜交兩個。例如如果你把第一個，並將其轉換爲這樣的事情：

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id") 
WHERE ("app_edge"."dest_id" = 1061 
     OR T4."orig_id" = 1061);

兩個問題有：如何該版本執行，以及它給你，你要尋找的結果？

欲瞭解更多關於原始查詢的信息，請查看的Django doc。this section。

響應從OP評論：

因爲我還建議查詢的查詢計劃表明，它沒有擊中任何索引。

對於涉及的列，您是否在兩個表上都有索引？我懷疑沒有，特別是對於這個特定的查詢，我們正在尋找一個單一的值，這意味着如果有索引，如果查詢計劃者確定順序掃描是一個更好的選擇，我會感到非常驚訝（OTOH，如果你是查找各種各樣的行，例如表中超過10％的行，查詢規劃師可能會正確做出這樣的決定）。

來源

2017-07-07 00:07:40 khampson

剛剛嘗試過，它仍然需要一分半鐘才能運行。我已經更新了我的問題以包含此版本的查詢的解釋。 – AJMansfield

@AJMansfield：添加到以上評論的後續回答中。 – khampson

我不確定是否有索引或沒有索引。這些表只是使用普通的django migrations直接從模型類創建的，所以我不確定它是否創建了索引。 – AJMansfield

我提出的另一個查詢可能是：

# Get edges which contain Vertex v, "only" optimizes fields returned 
edges = Edge.objects.filter(Q(orig=v) | Q(dest=v)).only('orig_id', 'dest_id') 
# Get set of vertex id's to discard duplicates 
vertex_ids = {*edges.values_list('orig_id', flat=True), *edges_values_list('dest_id', flat=True)} 
# Get list of vertices, excluding the original vertex 
vertices = Vertex.objects.filter(pk__in=vertex_ids).exclude(pk=v.pk)

這應該不需要任何連接和不應該從你所提到的競爭條件受到影響。

來源

2017-07-07 03:32:33 ARJMP

查詢檢索鄰居太慢

回答

相關問題