我最近編寫了一個程序,找到兩個維基百科文章之間的最短路徑。問題是獲取頁面上的所有鏈接並將其放入圖表需要很長時間。尋找路徑是一件容易的事情。 Basicly我在做什麼是這樣的:如何加快程序,找到兩個維基百科文章之間的最短路徑
startingPage = 'Lisbon'
target = 'Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph, startingPage)
while found != True:
for node in list(graph):
if graph.out_degree(node) == 0:
found = pages.import_page(graph, node)
if found == True:
break;
而且我import_page功能是這一個:
def import_page(graph, starting, target):
general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
data_str = general_str + starting
encoded_data_str = data_str.encode('utf-8') #Sanitize input
response = url.urlopen(encoded_data_str)
data = json.loads(response.read())
pageId = data['query']['pages'].keys()
print starting
if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
return False
elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
return False
for jsonObject in data['query']['pages'][pageId[0]]['links']:
graph.add_node(jsonObject['title'])
graph.add_edge(starting, jsonObject['title'])
if jsonObject['title'] == target:
return True
while data.keys()[0] != 'batchcomplete':
continueId = data['continue']['plcontinue']
continue_str = data_str + '&plcontinue=' + continueId
encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
response = url.urlopen(encoded_continue_str)
data = json.loads(response.read())
for jsonObject in data['query']['pages'][pageId[0]]['links']:
graph.add_node(jsonObject['title'])
graph.add_edge(starting, jsonObject['title'])
if jsonObject['title'] == target:
return True
return False
問題對於任何距離大於2/3的鏈接是一個採取一個巨大的多少時間。關於我如何加快速度的任何想法?
爲什麼你關心的邊緣?你不是隻想保留一個活動節點列表和訪問節點列表嗎?另一個問題(可能只是由於我不知道圖形庫而導致):在遍歷其節點時修改圖形是否有效? –
另外,它寫在這裏的方式,你的'import_page'確實是'import_Adolf_Hitler_page' :) –
是的,我已經更新了,忘了在這裏更新! –