0

我試圖從鏈接中刪除所有文件路徑:https://github.com/themichaelusa/Trinitum/find/master,根本不使用GitHub API。從GitHub Repo刮取文件路徑產生400響應,但在瀏覽器中查看正常工作

上面的鏈接在HTML中包含一個data-url屬性(table,id ='tree-finder-results',class ='tree-browser css-truncate'),用於製作這樣的URL :https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd

,顯示這本字典:

{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]} 

,當你在Chrome等瀏覽器中查看它。但是,GET請求產生<[400] Response>

這裏是我使用的代碼:

username, repo = ‘themichaelusa’, ‘Trinitum’ 
ghURL = 'https://github.com' 
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo)) 
html = requests.get(url) 
soup = BeautifulSoup(html.text, "lxml") 
repoContent = soup.find('div', class_='tree-finder clearfix') 
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url']) 
filePaths = requests.get(fileLinksURL) 
print(filePaths) 

不知道什麼是錯的。我的理論是,第一個鏈接創建一個cookie,允許第二個鏈接顯示我們定位的回購的文件路徑。我只是不確定如何通過代碼實現此目的。真的會感激一些指針!

+0

你注意'例子/ advanced_example.py'是不是相對於'的https:// github.com/themichaelusa/Trinitum /發現/ master'的,但是'的https :// github.com/themichaelusa/Trinitum/blob/master'? –

+0

我的建議是使用瀏覽器的開發工具仔細控制實際發送的請求,打印'url'和'fileLinksURL'並進行比較。 –

回答

0

給它一個去。包含.py文件的鏈接是動態生成的,因此要捕捉它們,您需要使用硒。我認爲這是你的預期。

from selenium import webdriver ; from bs4 import BeautifulSoup 
from urllib.parse import urljoin 

url = 'https://github.com/themichaelusa/Trinitum/find/master' 
driver=webdriver.Chrome() 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, "lxml") 
driver.quit() 
for link in soup.select('#tree-finder-results .js-tree-finder-path'): 
    print(urljoin(url,link['href'])) 

部分結果:

https://github.com/themichaelusa/Trinitum/blob/master 
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py 
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py 
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE 
https://github.com/themichaelusa/Trinitum/blob/master/README.md 
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py 
+0

@Michael Usachenko,你有沒有試過這段代碼? – SIM

相關問題