從GitHub Repo刮取文件路徑產生400響應，但在瀏覽器中查看正常工作

我試圖從鏈接中刪除所有文件路徑：https://github.com/themichaelusa/Trinitum/find/master，根本不使用GitHub API。從GitHub Repo刮取文件路徑產生400響應，但在瀏覽器中查看正常工作

上面的鏈接在HTML中包含一個data-url屬性（table，id ='tree-finder-results'，class ='tree-browser css-truncate'），用於製作這樣的URL ：https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd

，顯示這本字典：

{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}

，當你在Chrome等瀏覽器中查看它。但是，GET請求產生<[400] Response>。

這裏是我使用的代碼：

username, repo = ‘themichaelusa’, ‘Trinitum’ 
ghURL = 'https://github.com' 
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo)) 
html = requests.get(url) 
soup = BeautifulSoup(html.text, "lxml") 
repoContent = soup.find('div', class_='tree-finder clearfix') 
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url']) 
filePaths = requests.get(fileLinksURL) 
print(filePaths)

不知道什麼是錯的。我的理論是，第一個鏈接創建一個cookie，允許第二個鏈接顯示我們定位的回購的文件路徑。我只是不確定如何通過代碼實現此目的。真的會感激一些指針！

來源

2017-10-05 Michael Usachenko

你注意'例子/ advanced_example.py'是不是相對於'的https：// github.com/themichaelusa/Trinitum /發現/ master'的，但是'的https ：// github.com/themichaelusa/Trinitum/blob/master'？ –

我的建議是使用瀏覽器的開發工具仔細控制實際發送的請求，打印'url'和'fileLinksURL'並進行比較。 –

給它一個去。包含.py文件的鏈接是動態生成的，因此要捕捉它們，您需要使用硒。我認爲這是你的預期。

from selenium import webdriver ; from bs4 import BeautifulSoup 
from urllib.parse import urljoin 

url = 'https://github.com/themichaelusa/Trinitum/find/master' 
driver=webdriver.Chrome() 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, "lxml") 
driver.quit() 
for link in soup.select('#tree-finder-results .js-tree-finder-path'): 
    print(urljoin(url,link['href']))

部分結果：

https://github.com/themichaelusa/Trinitum/blob/master 
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py 
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py 
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE 
https://github.com/themichaelusa/Trinitum/blob/master/README.md 
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py

來源

2017-10-05 07:43:40 SIM

@Michael Usachenko，你有沒有試過這段代碼？ – SIM

從GitHub Repo刮取文件路徑產生400響應，但在瀏覽器中查看正常工作

回答

相關問題